Introduction

Here, we will evaluate methods for deriving TWAS SNP-weights based on eQTL summary statistics.

There are two stages to this process:

  1. Determining which genes to derive TWAS SNP-weights for
  2. Derive SNP-weights predicting gene expression

Determining which genes to derive TWAS SNP-weights for

Existing approaches:

  • For SMR this is done by analysing any gene with a genome-wide significant eQTL
  • For FUSION this is done by analysing any gene with cis-SNPh2 p < 0.01
  • For MetaXcan this is done by analysing any gene with significant positive R2 using SNP-weights

The FUSION and MetaXcan approach is better for determining genes to include in a GWAS as it better reflects the power that TWAS would have. We may not always have a target sample to test the variance explained by the SNP-weights, so the MetaXcan approach is less applicable. So, I think we should use the same criteria as FUSION, except we must use a summary statistic based approach to estimate SNP-based heritability. We should compare summary statistic approaches to estimates from GREML as reported in FUSION released SNP-weights. The approach also needs to be fast as it will need to be implemented for each gene seperately.


Derive SNP-weights predicting gene expression

SMR uses single eQTL summary statistics, making it applicable to a wider range of datasets and results from meta-analyses. TWAS based methods are not currently relevent when only eQTL summary statistics are available, as the advantage of these methods over SMR is their ability to incorporate the effects of multiple SNPs on gene expression. Currently, TWAS methods use SNP-weights derived using individual level data, which means TWAS cannot use meta-analysis results for eQTLs, and often TWAS weights are unavailable for the latest eQTL datasets. So, here we will compare a range of summary statistic approaches to derive TWAS weights. Again, the method will need to be fast since it will need to be applied to each genes seperately. Summary statistic based polygenic scoring methods will likely be of use, though given the lack of target samples for validation, a pseudovalidation approach will be necessary, which estimates the optimal parameters without a target sample for validation.

Possible methods for deriving SNP-weights:

  • SBayesR
  • DBSLMM
  • PRScs auto
  • Mega-PRS
  • SuSiE

Methods


Comparison using GTEx v7 eQTL data and FUSION SNP-weights

Comparison to FUSION weights is not ideal as by chance different leads SNP could be selected. Ideally we would compare summary statistic-based and observed TWAS weights to observed expression. However, lets start with this as it will allow comparison of heritability estimates and be a good opportunity to design the study.

We will use whole blood data for the comparison. eQTL summary statistics are downloaded from here. FUSION TWAS SNP-weights were downloaded from here.


Compare SNP-h2 estimates

We will compare different version of GCTB Bayes models and LDSC. If LDSC doesn’t work, then LDAK model is unlikely to make a difference. To avoid unessecary compuational burden, only estimate hsq for 100 genes on chromosome 22.

Show code


# Split the GTEx eQTL data by chromosome
for chr in $(seq 1 22);do
pattern=$(echo ^${chr}_)
awk -v pat="$pattern" '$2 ~ pat { print }' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.txt > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt_tmp
cat <(head -n1 /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.txt) /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt_tmp > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt
rm /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt_tmp
done
library(data.table)

# Make output directory
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/RDat_files')

# Read in HapMap3 SNP-list
hm3_snp<-fread('/users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist')

sbayesr_h2<-NULL
sbayesr_robust_h2<-NULL
sbayess_h2<-NULL
ldsc_h2<-NULL

# Run across each chromosome seperately
for(chr_i in 22){
  # Read in eQTL data
  eqtl<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr',chr_i,'.txt'))
  
  # Create CHR, BP, A1, A2, and Build columns
  var_id<-data.table(do.call(rbind, strsplit(eqtl$variant_id, '_')))
  names(var_id)<-c('CHR','BP','A1','A2','Build')
  var_id$CHR<-as.numeric(var_id$CHR)
  var_id$BP<-as.numeric(var_id$BP)
  
  eqtl<-cbind(eqtl, var_id)
  
  # Insert IUPAC codes for SNPs
  eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='A']<-'W'
  eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='C']<-'S'
  eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='A']<-'R'
  eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='C']<-'Y'
  eqtl$IUPAC[eqtl$A1 == 'G' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='G']<-'K'
  eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='C' | eqtl$A1 == 'C' & eqtl$A2 =='A']<-'M'
  
  # Remove INDELS
  eqtl<-eqtl[!is.na(eqtl$IUPAC),]
  
  # Calculate N for each association
  eqtl$N<-round((eqtl$ma_count/eqtl$maf)/2,1)
  
  # Check the build
  table(eqtl$Build) 
  # They are all b37
  
  ###
  # Insert RSIDs
  ###
  
  # Read in 1KG reference SNP data
  ref_bim_i<-fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/all_phase3.chr',chr_i,'.bim'))
  names(ref_bim_i)<-c('CHR','SNP','POS','BP','A1','A2')
  
  # Extract hm3 SNPs
  ref_bim_i<-ref_bim_i[ref_bim_i$SNP %in% hm3_snp$SNP,]
  
  # Insert IUPAC codes
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='A']<-'W'
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='C']<-'S'
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='A']<-'R'
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='C']<-'Y'
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='G']<-'K'
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='C' | ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='A']<-'M'

  # Merge target and reference based on position
  ref_target<-merge(eqtl, ref_bim_i, by=c('CHR','BP'))
  
  # I have used the GTEx reference data as well, and the same number is returned. 
  # Note: The A1 allele is the ref allele.
  # Small number of SNPs remaining highlights the limitation of using HapMap3 SNPs only.
  
  # Identify SNPs for which alleles need to be flipped
  flip_tmp<-ref_target[(ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'Y' | 
                          ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'R' | 
                          ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'M' |
                          ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'K'),]
  
  # Idenitfy SNPs which match the reference alleles
  incl<-ref_target[ ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'R' | 
                      ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'Y' | 
                      ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'K' |
                      ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'M' ]
  
  # If a SNP that needs to be flipped has a duplicate that is on the correct strand, remove it.
  flip<-flip_tmp[!(flip_tmp$SNP %in% incl$SNP)]
  
  # Flip alleles where necessary
  flip_tmp$A1_flipped<-flip_tmp$A1.x
  flip_tmp$A1_flipped[flip_tmp$A1.x == 'A']<-'T'
  flip_tmp$A1_flipped[flip_tmp$A1.x == 'T']<-'C'
  flip_tmp$A1_flipped[flip_tmp$A1.x == 'G']<-'C'
  flip_tmp$A1_flipped[flip_tmp$A1.x == 'C']<-'G'
  flip_tmp$A1.x<-flip_tmp$A1_flipped
  flip_tmp$A1_flipped<-NULL
  
  flip_tmp$A2_flipped<-flip_tmp$A2.x
  flip_tmp$A2_flipped[flip_tmp$A2.x == 'A']<-'T'
  flip_tmp$A2_flipped[flip_tmp$A2.x == 'T']<-'C'
  flip_tmp$A2_flipped[flip_tmp$A2.x == 'G']<-'C'
  flip_tmp$A2_flipped[flip_tmp$A2.x == 'C']<-'G'
  flip_tmp$A2.x<-flip_tmp$A2_flipped
  flip_tmp$A2_flipped<-NULL
  
  # Recombine and format
  eqtl_harm<-rbind(flip_tmp, incl)
  eqtl_harm<-eqtl_harm[,c('CHR','SNP','BP','A1.x','A2.x',"gene_id","variant_id","tss_distance","ma_samples","ma_count","maf","pval_nominal","slope","slope_se","N"), with=F]
  names(eqtl_harm)[names(eqtl_harm) == 'A1.x']<-'A1'
  names(eqtl_harm)[names(eqtl_harm) == 'A2.x']<-'A2'

  # Remove variants with MAF < 1%
  eqtl_harm<-eqtl_harm[eqtl_harm$maf >= 0.01,]
  
  eqtl<-eqtl_harm
  rm(eqtl_harm)
  
  # Identify unique genes 
  genes<-unique(eqtl$gene_id)
  
  # Subset the first 100 genes
  genes<-genes[1:100]

  # Run for each gene seperately
  for(gene_i in genes){
    print(which(genes == gene_i))

    eqtl_gene_i<-eqtl[eqtl$gene_id == gene_i,]
    
    #######
    # SBayesR
    #######
    
    # Format for SBayesR
    eqtl_gene_i_sbayesr<-eqtl_gene_i[,c('SNP','A1','A2','maf','slope','slope_se','pval_nominal','N'), with=F]
    names(eqtl_gene_i_sbayesr)<-c('SNP','A1','A2','freq','b','se','p','N')
    
    # write in SBayesR format
    fwrite(eqtl_gene_i_sbayesr, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)
    
    # Run SBayesR
    log<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt --chain-length 10000 --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR'), intern=T)
    
    # Run SBayesR
    log2<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt --chain-length 10000 --robust --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR_robust'), intern=T)
    
    #######
    # SBayesS
    #######
    
    # Run SBayesS
    log3<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes S --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.1 --hsq 0.5 --chain-length 25000 --burn-in 5000 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt --exclude-mhc --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS/',gene_i,'.SBayesS'), intern=T)
    
    #######
    # LDSC
    #######
    
    # Format for LDSC
    eqtl_gene_i_ldsc<-eqtl_gene_i[,c('SNP','A1','A2','slope','slope_se','N'), with=F]
    names(eqtl_gene_i_ldsc)<-c('SNP','A1','A2','BETA','SE','N')
    eqtl_gene_i_ldsc$Z<-eqtl_gene_i_ldsc$BETA/eqtl_gene_i_ldsc$SE
    eqtl_gene_i_ldsc<-eqtl_gene_i_ldsc[,c('SNP','A1','A2','Z','N'), with=F]
    
    # write in LDSC format
    fwrite(eqtl_gene_i_ldsc, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)
    
    # Run LDSC
    log4<-system(paste0('/users/k1806347/brc_scratch/Software/ldsc.sh --h2 /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.txt --ref-ld-chr /users/k1806347/brc_scratch/Data/ldsc/eur_w_ld_chr/ --w-ld-chr /users/k1806347/brc_scratch/Data/ldsc/eur_w_ld_chr/ --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.h2'), intern=T)

    ########
    # Tabulate SNP-h2 results
    ########
    
    # Read SbayesR heritability result
    if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR.parRes'))){
      
      par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR.parRes'))
      par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
      par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
      par_res_file_i$Gene<-gene_i
      par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
      names(par_res_file_i)<-c('gene','hsq','se','p')
      sbayesr_h2<-rbind(sbayesr_h2, par_res_file_i)
    }

    if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR_robust.parRes'))){
      
      par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR_robust.parRes'))
      par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
      par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
      par_res_file_i$Gene<-gene_i
      par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
      names(par_res_file_i)<-c('gene','hsq','se','p')
      sbayesr_robust_h2<-rbind(sbayesr_robust_h2, par_res_file_i)
    }

    # Read SbayesS heritability result
    if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS/',gene_i,'.SBayesS.parRes'))){
      
      par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS/',gene_i,'.SBayesS.parRes'))
      par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
      par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
      par_res_file_i$Gene<-gene_i
      par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
      names(par_res_file_i)<-c('gene','hsq','se','p')
      sbayess_h2<-rbind(sbayess_h2, par_res_file_i)
    }
    
    # Read LDSC heritability result
    if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.h2.log'))){
      
      par_res_file_i<-readLines(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.h2.log'))
      par_res_file_i<-gsub('.*: ','',par_res_file_i[grepl('Total Observed scale h2',par_res_file_i)])
      hsq<-as.numeric(gsub(' .*','',par_res_file_i))
      se<-as.numeric(gsub(")",'',gsub(".*\\(",'',par_res_file_i)))
      p<-pnorm(hsq/se, lower.tail = F)
      
      ldsc_h2<-rbind(ldsc_h2, data.frame(gene=gene_i,
                                         hsq=hsq,
                                         se=se,
                                         p=p))
      
    }
  }
}

# Compare hsq estimates across methods
fusion_files<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/', pattern='.RDat')
fusion_files<-gsub('.wgt.RDat','',gsub('Whole_Blood.','',fusion_files))
fusion_files<-intersect(fusion_files, genes)

greml_h2<-NULL

for(genes_i in fusion_files){
  load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/Whole_Blood.',genes_i,'.wgt.RDat'))
  fusion<-data.frame(wgt.matrix)
  fusion$SNP<-row.names(fusion)
  greml<-c(hsq, hsq.pv)

  greml_h2<-rbind(greml_h2, data.frame(gene=genes_i,
                                         hsq=greml[1],
                                         se=greml[2],
                                         p=greml[3]))
}

# Combine results across methods
sbayesr_h2$method<-'sbayesr'
sbayesr_robust_h2$method<-'sbayesr_robust'
sbayess_h2$method<-'sbayess'
ldsc_h2$method<-'ldsc'
greml_h2$method<-'greml'

all_h2<-do.call(rbind, list(sbayesr_h2, 
                            sbayesr_robust_h2, 
                            sbayess_h2, 
                            ldsc_h2, 
                            greml_h2))

write.table(all_h2, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/h2_estimates.txt', col.names=T, row.names=F, quote=F)
# Plot the results
library(data.table)

all_h2<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/h2_estimates.txt')

library(UpSetR)

# Remove estimates which did not converge
all_h2<-all_h2[all_h2$se != 0,]

# Merge estimates an check correlation
all_h2_full_list<-split(all_h2[,c('gene','hsq'),with=F], f = all_h2$method)

for(i in names(all_h2_full_list)){
  names(all_h2_full_list[[which(names(all_h2_full_list) == i)]])<-c('gene',i)
}

all_h2_full_dat<-Reduce(function(...) merge(..., by='gene', all=T), all_h2_full_list)

library("ggplot2")
library("GGally")   

lowerfun <- function(data,mapping){
  ggplot(data = data, mapping = mapping)+
    geom_point() +
    geom_abline(intercept =0 , slope = 1, colour='red')
}  

png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/h2_estimates.png', units = 'px', res=300, width=2000, height=2000)
ggpairs(all_h2_full_dat[,-1], lower = list(continuous = wrap(lowerfun)))
dev.off()

# Note. LDSC estimates are very inaccurate and do not correlated well with other methods. SBayesR correlates best wit GREML, and SBayesR and SBayesR with robust setting are highly correlated. The SBayesS method correlates less well with other methods, but this doesn't mean it is inaccurate as we don't know the truth.

# Check the number of genes that are significant in each method and how they overlap across methods.

# Extract significant estimates
all_h2_sig<-all_h2[all_h2$p < 0.01,]
n_gene<-data.frame(table(all_h2_sig$method))
names(n_gene)<-c('Method','N_genes')

# The number is similar across methods, but SBayesR has the highest.
all_h2_sig_list<-split(all_h2_sig[['gene']], f = all_h2_sig$method)

# Plot number of genes with valid and significant estimates
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/n_sig_hsq.png', units = 'px', res=300, width=1500, height=1000)
upset(fromList(all_h2_sig_list), nsets=10, order.by = "freq")
dev.off()
# Again LDSC is least concordant with other methods.

# Compare number overlapping with GREML list
all_h2_greml_sig_list<-all_h2_sig_list
for(i in names(all_h2_greml_sig_list)){
  tmp<-all_h2_greml_sig_list[[which(names(all_h2_greml_sig_list) == i)]]
  tmp<-tmp[tmp %in% all_h2_greml_sig_list[['greml']]]
  all_h2_greml_sig_list[[which(names(all_h2_greml_sig_list) == i)]]<-tmp
}

n_gene$GREML_overlap<-unlist(lapply(all_h2_greml_sig_list, length))

write.csv(n_gene, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/n_sig_hsq.csv', row.names=F)

# Note. SBayesR has the largest overlap with GREML. Given this, and that SBayesR finds the most significant genes, I think we should use SBayesR without robust parameteristation to estimate heritability.

Show SNP-h2 estimates across methods

LDSC estimates are very inaccurate and do not correlated well with other methods. SBayesR correlates best wit GREML, and SBayesR and SBayesR with robust setting are highly correlated. The SBayesS method correlates less well with other methods, but this doesn’t mean it is inaccurate as we don’t know the truth.

Show number of genes with significant SNP-h2

Method N_genes GREML_overlap
greml 39 39
ldsc 30 21
sbayesr 36 33
sbayesr_robust 33 27
sbayess 35 27

The number is similar across methods, but SBayesR has the highest.

SBayesR has the largest overlap with GREML. Given this, and that SBayesR finds the most significant genes, I think we should use SBayesR without robust parameteristation to estimate heritability.

Again LDSC is least concordant with other methods.


Compare methods for generating SNP-weights

Now we have decided which genes to create SNP-weights for, we will generate SNP-weights for predicting gene expression using a range of methods. Then we will compare these SNP-weights to those derived by FUSION. Again, to avoid computational burden, start by running for the first 100 genes on chromosome 22.

Show code

library(data.table)

# Make output directory
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR_robust')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files')

# Read in HapMap3 SNP-list
hm3_snp<-fread('/users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist')

sbayesr_h2<-NULL

# Run across each chromosome seperately
for(chr_i in 22){
  # Read in eQTL data
  eqtl<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr',chr_i,'.txt'))
  
  # Create CHR, BP, A1, A2, and Build columns
  var_id<-data.table(do.call(rbind, strsplit(eqtl$variant_id, '_')))
  names(var_id)<-c('CHR','BP','A1','A2','Build')
  var_id$CHR<-as.numeric(var_id$CHR)
  var_id$BP<-as.numeric(var_id$BP)
  
  eqtl<-cbind(eqtl, var_id)
  
  # Insert IUPAC codes for SNPs
  eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='A']<-'W'
  eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='C']<-'S'
  eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='A']<-'R'
  eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='C']<-'Y'
  eqtl$IUPAC[eqtl$A1 == 'G' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='G']<-'K'
  eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='C' | eqtl$A1 == 'C' & eqtl$A2 =='A']<-'M'
  
  # Remove INDELS
  eqtl<-eqtl[!is.na(eqtl$IUPAC),]
  
  # Calculate N for each association
  eqtl$N<-round((eqtl$ma_count/eqtl$maf)/2,1)
  
  # Check the build
  table(eqtl$Build) 
  # They are all b37
  
  ###
  # Insert RSIDs
  ###
  
  # Read in 1KG reference SNP data
  ref_bim_i<-fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/all_phase3.chr',chr_i,'.bim'))
  names(ref_bim_i)<-c('CHR','SNP','POS','BP','A1','A2')
  
  # Extract hm3 SNPs
  ref_bim_i<-ref_bim_i[ref_bim_i$SNP %in% hm3_snp$SNP,]
  
  # Insert IUPAC codes
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='A']<-'W'
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='C']<-'S'
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='A']<-'R'
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='C']<-'Y'
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='G']<-'K'
  ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='C' | ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='A']<-'M'

  # Merge target and reference based on position
  ref_target<-merge(eqtl, ref_bim_i, by=c('CHR','BP'))
  
  # I have used the GTEx reference data as well, and the same number is returned. 
  # Note: The A1 allele is the ref allele.
  # Small number of SNPs remaining highlights the limitation of using HapMap3 SNPs only.
  
  # Identify SNPs for which alleles need to be flipped
  flip_tmp<-ref_target[(ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'Y' | 
                          ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'R' | 
                          ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'M' |
                          ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'K'),]
  
  # Idenitfy SNPs which match the reference alleles
  incl<-ref_target[ ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'R' | 
                      ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'Y' | 
                      ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'K' |
                      ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'M' ]
  
  # If a SNP that needs to be flipped has a duplicate that is on the correct strand, remove it.
  flip<-flip_tmp[!(flip_tmp$SNP %in% incl$SNP)]
  
  # Flip alleles where necessary
  flip_tmp$A1_flipped<-flip_tmp$A1.x
  flip_tmp$A1_flipped[flip_tmp$A1.x == 'A']<-'T'
  flip_tmp$A1_flipped[flip_tmp$A1.x == 'T']<-'C'
  flip_tmp$A1_flipped[flip_tmp$A1.x == 'G']<-'C'
  flip_tmp$A1_flipped[flip_tmp$A1.x == 'C']<-'G'
  flip_tmp$A1.x<-flip_tmp$A1_flipped
  flip_tmp$A1_flipped<-NULL
  
  flip_tmp$A2_flipped<-flip_tmp$A2.x
  flip_tmp$A2_flipped[flip_tmp$A2.x == 'A']<-'T'
  flip_tmp$A2_flipped[flip_tmp$A2.x == 'T']<-'C'
  flip_tmp$A2_flipped[flip_tmp$A2.x == 'G']<-'C'
  flip_tmp$A2_flipped[flip_tmp$A2.x == 'C']<-'G'
  flip_tmp$A2.x<-flip_tmp$A2_flipped
  flip_tmp$A2_flipped<-NULL
  
  # Recombine and format
  eqtl_harm<-rbind(flip_tmp, incl)
  eqtl_harm<-eqtl_harm[,c('CHR','SNP','BP','A1.x','A2.x',"gene_id","variant_id","tss_distance","ma_samples","ma_count","maf","pval_nominal","slope","slope_se","N"), with=F]
  names(eqtl_harm)[names(eqtl_harm) == 'A1.x']<-'A1'
  names(eqtl_harm)[names(eqtl_harm) == 'A2.x']<-'A2'

  # Remove variants with MAF < 1%
  eqtl_harm<-eqtl_harm[eqtl_harm$maf >= 0.01,]
  
  eqtl<-eqtl_harm
  rm(eqtl_harm)
  
  # Identify unique genes 
  genes<-unique(eqtl$gene_id)
  genes<-genes[1:100]

  # Run for each gene seperately
  for(gene_i in genes){
    print(which(genes == gene_i))

    eqtl_gene_i<-eqtl[eqtl$gene_id == gene_i,]
    
    # Sort by chromosome and bp
    eqtl_gene_i<-eqtl_gene_i[order(eqtl_gene_i$CHR, eqtl_gene_i$BP),]

    # Filter SNPs to those with N > 80% of max(N)
    eqtl_gene_i<-eqtl_gene_i[eqtl_gene_i$N >= 0.8*max(eqtl_gene_i$N),]
    
    #######
    # SBayesR
    #######
    
    # Format for SBayesR
    eqtl_gene_i_sbayesr<-eqtl_gene_i[,c('SNP','A1','A2','maf','slope','slope_se','pval_nominal','N'), with=F]
    names(eqtl_gene_i_sbayesr)<-c('SNP','A1','A2','freq','b','se','p','N')
    
    # write in SBayesR format
    fwrite(eqtl_gene_i_sbayesr, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)
    
    # Run SBayesR
    log<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.txt --chain-length 10000 --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR'), intern=T)
    
    # Read SbayesR heritability result
    if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR.parRes'))){
      
      par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR.parRes'))
      par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
      par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
      par_res_file_i$Gene<-gene_i
      par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
      names(par_res_file_i)<-c('gene','hsq','se','p')
      sbayesr_h2<-rbind(sbayesr_h2, par_res_file_i)
      
      # If SNP-h2 p < 0.01, generate SNP-weights
      if(par_res_file_i$p < 0.01){
        # Flip the effect of each method to match eqtl sumstats
        ref_tmp<-eqtl_gene_i[, c('SNP','A1','A2'), with=F]
        
        #############
        # SBayesR
        #############
        # SBayesR has already been run, so just read in the SNP-weights
        sbayesr_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR.snpRes'))
        sbayesr_score<-sbayesr_score[,c('Name','A1','A2','A1Effect'), with=F]
        names(sbayesr_score)<-c('SNP','A1','A2','BETA')
        
        # Flip effects so allele match eQTL sumstats
        sbayesr_score<-merge(ref_tmp, sbayesr_score, by='SNP', all=T)
        sbayesr_score$BETA[which(sbayesr_score$A1.x == sbayesr_score$A2.y)]<--sbayesr_score$BETA[which(sbayesr_score$A1.x == sbayesr_score$A2.y)]
        sbayesr_score<-sbayesr_score[,c('SNP','A1.x','A2.x','BETA'), with=F]
        names(sbayesr_score)<-c('SNP','A1','A2','BETA')
        
        # Sort score file according eqtl_gene_i
      sbayesr_score<-sbayesr_score[match(eqtl_gene_i$SNP, sbayesr_score$SNP),]

        #############
        # SBayesR robust
        #############
        # SBayesR format sumstats are already available
        # So just run SBayesR with robust setting
        
        log<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.txt --robust --chain-length 10000 --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR_robust/',gene_i,'.SBayesR'), intern=T)

        # Read in the results
        sbayesr_robust_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR_robust/',gene_i,'.SBayesR.snpRes'))
        sbayesr_robust_score<-sbayesr_robust_score[,c('Name','A1','A2','A1Effect'), with=F]
        names(sbayesr_robust_score)<-c('SNP','A1','A2','BETA')
        
        # Flip effects so allele match eQTL sumstats
        sbayesr_robust_score<-merge(ref_tmp, sbayesr_robust_score, by='SNP', all=T)
        sbayesr_robust_score$BETA[which(sbayesr_robust_score$A1.x == sbayesr_robust_score$A2.y)]<--sbayesr_robust_score$BETA[which(sbayesr_robust_score$A1.x == sbayesr_robust_score$A2.y)]
        sbayesr_robust_score<-sbayesr_robust_score[,c('SNP','A1.x','A2.x','BETA'), with=F]
        names(sbayesr_robust_score)<-c('SNP','A1','A2','BETA')

        # Sort score file according eqtl_gene_i
      sbayesr_robust_score<-sbayesr_robust_score[match(eqtl_gene_i$SNP, sbayesr_robust_score$SNP),]

        #############
        # DBSLMM
        #############
        
        # Convert to GEMMA format
        eqtl_gene_i_dbslmm<-eqtl_gene_i
        eqtl_gene_i_dbslmm$N_MISS<-max(eqtl_gene_i_dbslmm$N)-eqtl_gene_i_dbslmm$N
        eqtl_gene_i_dbslmm<-eqtl_gene_i_dbslmm[,c('CHR','SNP','BP','N_MISS','N','A1','A2','maf','slope','slope_se','pval_nominal'),with=F]
          names(eqtl_gene_i_dbslmm)<-c('chr','rs','ps','n_mis','n_obs','allele1','allele0','af','beta','se','p_wald')
        
        # Match allele1 and 0 with A1 and 2 in reference (DBSLMM calls this allele discrepancy)
        ref_bim_subset<-fread(paste0('/scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,'.bim'))
      
        GWAS_match<-merge(eqtl_gene_i_dbslmm, ref_bim_subset[,c('V2','V5','V6'),with=F], by.x=c('rs','allele1','allele0'), by.y=c('V2','V5','V6'))
        GWAS_switch<-merge(eqtl_gene_i_dbslmm, ref_bim_subset[,c('V2','V5','V6'),with=F], by.x=c('rs','allele1','allele0'), by.y=c('V2','V6','V5'))
        GWAS_switch$allele_tmp<-GWAS_switch$allele0
        GWAS_switch$allele0<-GWAS_switch$allele1
        GWAS_switch$allele1<-GWAS_switch$allele_tmp
        GWAS_switch$allele_tmp<-NULL
        GWAS_switch$beta<--GWAS_switch$beta
        GWAS_switch$af<-1-GWAS_switch$af
        GWAS<-rbind(GWAS_match, GWAS_switch)
        
        GWAS<-GWAS[order(GWAS$chr, GWAS$ps),]
        GWAS<-GWAS[,c('chr','rs','ps','n_mis','n_obs','allele1','allele0','af','beta','se','p_wald'),with=F]
        GWAS_N<-mean(GWAS$n_obs)
        nsnp<-nrow(GWAS)
        
        # Write out formatted sumstats
        fwrite(GWAS, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/',gene_i,'.DBSLMM'), sep='\t', col.names=F)

        # Run dbslmm
        system('chmod 777 /users/k1806347/brc_scratch/Software/DBSLMM/software/dbslmm')
        system(paste0('/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/DBSLMM/software/DBSLMM.R --plink /users/k1806347/brc_scratch/Software/plink1.9.sh --block /users/k1806347/brc_scratch/Data/LDetect/EUR/fourier_ls-chr',chr_i,'.bed --dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software/dbslmm --h2 ',par_res_file_i$hsq,' --ref /scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,' --summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/',gene_i,'.DBSLMM --n ',round(GWAS_N,0),' --nsnp ',nsnp,' --outPath /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/ --thread 1'))

        # Read in the results
        dbslmm_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/',gsub('\\..*','',gene_i),'.dbslmm.txt'))
        dbslmm_score<-dbslmm_score[,c(1,2,4), with=T]
        names(dbslmm_score)<-c('SNP','A1','BETA')

        # Flip effects so allele match eQTL sumstats
        dbslmm_score<-merge(ref_tmp, dbslmm_score, by='SNP', all=T)
        dbslmm_score$BETA[which(dbslmm_score$A1.x != dbslmm_score$A1.y)]<--dbslmm_score$BETA[which(dbslmm_score$A1.x != dbslmm_score$A1.y)]
        dbslmm_score<-dbslmm_score[,c('SNP','A1.x','A2','BETA'), with=F]
        names(dbslmm_score)<-c('SNP','A1','A2','BETA')

        # Sort score file according eqtl_gene_i
      dbslmm_score<-dbslmm_score[match(eqtl_gene_i$SNP, dbslmm_score$SNP),]

        ##############
        # PRScs
        ##############
        
        # Format for PRScs
        eqtl_gene_i_prscs<-eqtl_gene_i[,c('SNP','A1','A2','slope','pval_nominal'), with=F]
        names(eqtl_gene_i_prscs)<-c('SNP','A1','A2','BETA','P')
        
        # write in PRScs format
        fwrite(eqtl_gene_i_prscs, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)

        system(paste0('/users/k1806347/brc_scratch/Software/PRScs.sh --ref_dir=/users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur --bim_prefix=/scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,' --sst_file=/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,'.txt --n_gwas=',round(GWAS_N,0),' --out_dir=/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,' --chrom=',chr_i))

        # Read in the results
        prscs_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,'_pst_eff_a1_b0.5_phiauto_chr22.txt'))
        prscs_score<-prscs_score[,c('V2','V4','V6'), with=F]
        names(prscs_score)<-c('SNP','A1','BETA')

        # Flip effects so allele match eQTL sumstats
        prscs_score<-merge(ref_tmp, prscs_score, by='SNP', all=T)
        prscs_score$BETA[which(prscs_score$A1.x != prscs_score$A1.y)]<--prscs_score$BETA[which(prscs_score$A1.x != prscs_score$A1.y)]
        prscs_score<-prscs_score[,c('SNP','A1.x','A2','BETA'), with=F]
        names(prscs_score)<-c('SNP','A1','A2','BETA')
        
        # Sort score file according eqtl_gene_i
      prscs_score<-prscs_score[match(eqtl_gene_i$SNP, prscs_score$SNP),]

        # Note PRScs takes a lot longer than other methods.
        
        ################
        # SuSiE finemapping
        ################
        
        # Read LD estimates for eQTL sumstats
        write.table(eqtl_gene_i$SNP, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i,'_snps.txt'), col.names=F, row.names=F, quote=F)
        system(paste0('/users/k1806347/brc_scratch/Software/plink1.9.sh --bfile /scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,' --extract /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i,'_snps.txt --r square --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i))

        ld<-as.matrix(fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i,'.ld')))

        library(susieR)
        
        tryCatch(fitted_rss <- susie_rss(eqtl_gene_i$slope/eqtl_gene_i$slope_se, ld, L = 10), error = function(e){skip_to_next <<- TRUE})
        
        skip_to_next<-F
        if(skip_to_next == TRUE){
          susie_score<-data.table(SNP=eqtl_gene_i$SNP,
                                  A1=eqtl_gene_i$A1,
                                  BETA=NA)
        } else {
          susie_score<-data.table(SNP=eqtl_gene_i$SNP,
                        A1=eqtl_gene_i$A1,
                        BETA=eqtl_gene_i$slope*fitted_rss$pip)
        }
        
        # Flip effects so allele match eQTL sumstats
        susie_score<-merge(ref_tmp, susie_score, by='SNP', all=T)
        susie_score$BETA[which(susie_score$A1.x != susie_score$A1.y)]<--susie_score$BETA[which(susie_score$A1.x != susie_score$A1.y)]
        susie_score<-susie_score[,c('SNP','A1.x','A2','BETA'), with=F]
        names(susie_score)<-c('SNP','A1','A2','BETA')
        
        # Sort score file according eqtl_gene_i
      susie_score<-susie_score[match(eqtl_gene_i$SNP, susie_score$SNP),]

        # Create RDat file for FUSION
        cv.performance<-as.matrix(data.frame(sbayesr=c(NA,NA),
                                             sbayesr_robust=c(NA,NA),
                                             dbslmm=c(NA,NA),
                                             prscs=c(NA,NA),
                                             susie=c(NA,NA),
                                             top1=c(NA,NA), 
                                             row.names=c('rsq','pval')))
        
        # Sort eQTL data into the same order as other score files
        eqtl_gene_i<-eqtl_gene_i[match(sbayesr_score$SNP, eqtl_gene_i$SNP),]
        
        hsq<-c(par_res_file_i$hsq, par_res_file_i$se)
        hsq.pv<-par_res_file_i$p
        N.tot<-max(eqtl_gene_i$N)
        eqtl_gene_i$POS<-0
        snps<-eqtl_gene_i[,c('CHR','SNP','POS','BP','A1','A2')]
        names(snps)<-paste0('V',1:length(names(snps)))
        
        wgt.matrix<-as.matrix(data.frame(sbayesr=sbayesr_score$BETA,
                                         sbayesr_robust=sbayesr_robust_score$BETA,
                                         dbslmm=dbslmm_score$BETA,
                                         prscs=prscs_score$BETA,
                                         susie=susie_score$BETA,
                                         top1=eqtl_gene_i$slope,
                                         row.names=snps$V2))
        
        save(cv.performance, 
             hsq,
             hsq.pv,
             N.tot,
             snps,
             wgt.matrix,
             file = paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files/',gene_i,'.RDat'))
      }
    }
  }
}
library(data.table)

fusion_pos<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood.pos')
fusion_pos<-fusion_pos[fusion_pos$CHR == 22,]

dim(fusion_pos) # 234

eqtl_derived_files<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files/', pattern='.RDat')
fusion_files<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/', pattern='.RDat')

length(eqtl_derived_files) # 159
length(fusion_files) # 6006

eqtl_derived_files<-gsub('.RDat','',eqtl_derived_files)
fusion_files<-gsub('.wgt.RDat','',gsub('Whole_Blood.','',fusion_files))

genes<-intersect(eqtl_derived_files, fusion_files)
length(genes) # 33

nsnp_res<-NULL
hsq_res<-NULL
cor_res<-list()

for(genes_i in genes){
  load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files/',genes_i,'.RDat'))
  eqtl_derived<-data.frame(wgt.matrix)
  if(sum(eqtl_derived$susie) == 0){
    eqtl_derived$susie<-NA
  }
  eqtl_derived$SNP<-row.names(eqtl_derived)
  sbayesr<-c(hsq, hsq.pv)
  
  load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/Whole_Blood.',genes_i,'.wgt.RDat'))
  fusion<-data.frame(wgt.matrix)
  fusion$SNP<-row.names(fusion)
  greml<-c(hsq, hsq.pv)
  
  nsnp_res<-rbind(nsnp_res, data.frame(gene_id=genes_i,
                                       nsnp_eqtl_derived=nrow(eqtl_derived),
                                       nsnp_fusion=nrow(fusion)))
  
  hsq_res<-rbind(hsq_res, data.frame(gene_id=genes_i,
                                     sbayesr_h2=sbayesr[1],
                                     sbayesr_se=sbayesr[2],
                                     sbayesr_p=sbayesr[3],
                                     greml_h2=greml[1],
                                     greml_se=greml[2],
                                     greml_p=greml[3]))

  both<-merge(eqtl_derived, fusion, by='SNP')
  
  if(nrow(both) > 0){
    cor_res[[genes_i]]<-cor(both[,-1], use='p')
  }
}

# Compare h2
cor(hsq_res$sbayesr_h2, hsq_res$greml_h2) # 0.7500682

library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/h2_comp.png', units='px', width=1000, height=1000, res=300)
ggplot(hsq_res, aes(x=sbayesr_h2, y=greml_h2)) +
  geom_abline(intercept =0 , slope = 1, colour='red') +
  geom_point() +
  coord_fixed() +
  xlim(0,1) +
  ylim(0,1)
dev.off()

# Compare nsnp
cor(nsnp_res$nsnp_eqtl_derived, nsnp_res$nsnp_fusion) # 0.4849533

library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/nsnp_comp.png', units='px', width=1000, height=1000, res=300)
ggplot(nsnp_res, aes(x=nsnp_eqtl_derived, y=nsnp_fusion)) +
  geom_abline(intercept =0 , slope = 1, colour='red') +
  geom_point() +
  coord_fixed()
dev.off()

# Check SNP-weight correlations
cor_res_melt<-melt(cor_res)
cor_res_melt$Var1<-gsub('top1.x','top1.GTEx',cor_res_melt$Var1)
cor_res_melt$Var2<-gsub('top1.x','top1.GTEx',cor_res_melt$Var2)
cor_res_melt$Var1<-gsub('top1.y','top1.FUSION',cor_res_melt$Var1)
cor_res_melt$Var2<-gsub('top1.y','top1.FUSION',cor_res_melt$Var2)
cor_res_melt$test<-paste0(cor_res_melt$Var1,'_',cor_res_melt$Var2)

cor_res_average<-NULL
for(i in unique(cor_res_melt$test)){
  cor_res_average<-rbind(cor_res_average, data.frame(test=i,
                                                     Mean=mean(cor_res_melt$value[cor_res_melt$test == i])))
}

png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/cor_plot.png', units='px', width=3000, height=3000, res=300)
ggplot(cor_res_melt, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
  geom_boxplot() +
  coord_flip()
dev.off()

Show GREML and SBayesR SNP-h2 estimates

Show number of variants in FUSION and eQTL-based models

Show correlation between FUSION and eQTL-based models


The correlation between eQTL derived TWAS weights and FUSION derived TWAS weights is similar to correlation between different FUSION models.

The correlation between eQTL derived and FUSION weights is of interest, but not a good evaluation metric of methods, We should compare the correlation between predicted and observed expression, when using eQTL derived TWAS weights or largest FUSION TWAS weights.

The low correlation between topSNP results from FUSION and GTEx indicates comparison of FUSION and eQTL susmtat derived weights are not very informative.


Comparing predicted and observed gene expression in GTEx v8

Here we will predict gene expression levels in the GTEx v8 sample and then test the correlation with observered expression levels. Initially, we will use FUSION SNP-weights derived using the YFS whole blood sample, and eQTL data from eQTLGen whole blood meta-analysis (excl GTEx).

I do not have access to individual level data from GTeX v8, so this project is being carried out in collaboration with Zac Gerring. Here, I will prepare the SNP-weights and the code to predict gene expression levels for Zac. As an example target sample, I will use the EUR subset of the 1KG Phase 3 reference.

I have already written a script to predicted expression levels from TWAS weights called FeaturePred.


Generate SNP-weights from eQTLGen


Format eQTL sumstats

Show code

# Download the full cis-eQTL data exluding GTEx
# This was sent privately by Urmo Vosa
mkdir /users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx

# Extract relevent columns
zcat /users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx/2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-WoGTEx.txt.gz | cut -f 1-5,9-11,13 | gzip > /users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx/2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-WoGTEx.small.txt.gz
library(data.table)

# Read in relevent columns from the sumstats
sumstats<-fread('/users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx/2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-WoGTEx.small.txt.gz')

# Extract data for each gene
genes<-unique(sumstats$ProbeName)

# Read in EUR MAF
frq<-NULL
for(i in 1:22){
  frq<-rbind(frq, fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR.',i,'.frq')))
}
names(frq)[names(frq) == 'MAF']<-'FREQ'

setkey(sumstats, ProbeName)

# Process eQTL sumstats for each gene
# For testing use only first 100 genes
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen')
for(i in 1:length(genes)){
  print(i)
  
  tmp<-sumstats[.(genes[i])]
  
  # Create A1 and A2 columns and update column names
  tmp$A1<-gsub('.*/','',tmp$SNPType)
  tmp$A2<-gsub('/.*','',tmp$SNPType)
  tmp$A2[tmp$A1 != tmp$AlleleAssessed]<-gsub('.*/','',tmp$SNPType[tmp$A1 != tmp$AlleleAssessed])
  tmp$A1[tmp$A1 != tmp$AlleleAssessed]<-gsub('/.*','',tmp$SNPType[tmp$A1 != tmp$AlleleAssessed])

  tmp<-tmp[,c('ProbeName','SNPName','SNPChr','SNPChrPos', 'A1', 'A2','PValue','OverallZScore','SumNumberOfSamples'), with=F]
  names(tmp)<-c('Gene','SNP','CHR','BP','A1','A2','P','Z','N')

  # Insert FREQ from EUR reference
  # There don't seem to be any strand flips
  tmp_match<-merge(tmp, frq[frq$CHR == tmp$CHR[1],c('SNP','A1','A2','FREQ'),with=F], by=c('SNP','A1','A2'))
  tmp_flip<-merge(tmp, frq[frq$CHR == tmp$CHR[1],c('SNP','A1','A2','FREQ'),with=F], by.x=c('SNP','A1','A2'), by.y=c('SNP','A2','A1'))
  tmp_flip$FREQ<-1-tmp_flip$FREQ
  tmp<-rbind(tmp_match, tmp_flip)
  
  # Approximate BETA, SE, and P
  tmp$P<-2*pnorm(-abs(tmp$Z))
  tmp$BETA<-tmp$Z/sqrt((2*tmp$FREQ)*(1-tmp$FREQ)*(tmp$N+sqrt(abs(tmp$Z))))
  tmp$SE<-abs(tmp$BETA)/abs(tmp$Z)
  
  tmp$GENE<-genes[i]
    
  tmp<-tmp[,c('GENE','CHR','SNP','BP','A1','A2','BETA','SE','Z','P','FREQ','N'),with=F]
  
  if(i == 1){
    write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt', col.names=T, row.names=F, quote=F)
  } else {
    write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt', col.names=F, row.names=F, quote=F, append=T)
  }
}

Create TWAS weights

Show code


# Script modified to run only first 100 genes
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights.R \
    --extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
    --sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt \
    --gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
    --gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
    --plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
    --ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
  --rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
  --dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
  --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
  --PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
  --PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
    --output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/twas_weights
# Rename RDat folder to match FUSION format weights
mv /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/RDat_files /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL
# Create .pos file
library(data.table)

# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL', pattern='.RDat')

# Start making pos file
pos<-data.frame(PANEL='eQTLGen.eQTL',
                WGT=paste0('eQTLGen.eQTL/',rdat_list),
                ID=gsub('\\..*','',rdat_list))

# Insert CHR, P0 and P1 (GRCh=37)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id_version','chromosome_name','start_position','end_position'), mart = ensembl)
Genes$ensembl_gene_id_version<-gsub('\\..*','',Genes$ensembl_gene_id_version)
names(Genes)<-c('ID','CHR','P0','P1')
Genes<-Genes[complete.cases(Genes),]
Genes<-Genes[!duplicated(Genes),]

pos<-merge(pos, Genes, all.x=T, by='ID')

# Read in each RDat file to retrieve N
for(i in 1:length(rdat_list)){
  load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL/',rdat_list[i]))
  pos$N[i]<-N.tot
}

pos<-pos[,c('PANEL','WGT','ID','CHR','P0','P1','N')]

write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos', col.names=T, row.names=F, quote=F)

Create TWAS weights in parallel

Show code

# Run in parallel for first 100 genes

# Create file listing GWAS that haven't been processed.
cut -f 1 -d' ' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt | tail -n +2 | uniq | head -n 100 > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/gene_list.txt

> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/todo.txt
for i in $(cat /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/gene_list.txt);do
  if [ ! -f /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL/${i}.done ]; then
    echo ${i} >> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/todo.txt
  fi
done

# Create shell script to run using sbatch
cat > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/sbatch.sh << 'EOF'
#!/bin/sh

#SBATCH -p shared,brc
#SBATCH --mem 10G
#SBATCH -n 1
#SBATCH --nodes 1
#SBATCH -J eQTL_to_TWAS

ID=$(awk -v var="$SLURM_ARRAY_TASK_ID" 'NR == var {print $1}' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/todo.txt)

/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights_V2.R \
    --id ${ID} \
    --extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
    --sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt \
    --gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
    --gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
    --plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
    --plink_ref_keep /users/k1806347/brc_scratch/Data/1KG/Phase3/keep_files/EUR_samples.keep \
    --ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
  --rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
  --dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
  --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
  --PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
  --PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
    --ldpred2_ref_dir /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/ldpred2_ref \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL

EOF

sbatch --array 1-$(wc -l /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/todo.txt | cut -d' ' -f1) /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/sbatch.sh
# Create .pos file
library(data.table)

# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL', pattern='.RDat')

# Start making pos file
pos<-data.frame(PANEL='eQTLGen.eQTL',
                WGT=paste0('eQTLGen.eQTL/',rdat_list),
                ID=gsub('\\..*','',rdat_list))

# Insert CHR, P0 and P1 (GRCh=37)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id_version','chromosome_name','start_position','end_position'), mart = ensembl)
Genes$ensembl_gene_id_version<-gsub('\\..*','',Genes$ensembl_gene_id_version)
names(Genes)<-c('ID','CHR','P0','P1')
Genes<-Genes[complete.cases(Genes),]
Genes<-Genes[!duplicated(Genes),]

pos<-merge(pos, Genes, all.x=T, by='ID')

# Read in each RDat file to retrieve N
for(i in 1:length(rdat_list)){
  load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL/',rdat_list[i]))
  pos$N[i]<-N.tot
}

pos<-pos[,c('PANEL','WGT','ID','CHR','P0','P1','N')]

write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL.pos', col.names=T, row.names=F, quote=F)

Create TWAS weights in parallel for all genes

Show code

# Create file listing GWAS that haven't been processed.
mkdir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall
cut -f 1 -d' ' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt | tail -n +2 | uniq > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/gene_list.txt

> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/todo.txt
for i in $(cat /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/gene_list.txt);do
  if [ ! -f /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL/${i}.done ]; then
    echo ${i} >> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/todo.txt
  fi
done

# Create shell script to run using sbatch
cat > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/sbatch.sh << 'EOF'
#!/bin/sh

#SBATCH -p shared,brc
#SBATCH --mem 10G
#SBATCH -n 1
#SBATCH --nodes 1
#SBATCH -J eQTL_to_TWAS

ID=$(awk -v var="$SLURM_ARRAY_TASK_ID" 'NR == var {print $1}' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/todo.txt)

/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights_V2.R \
    --id ${ID} \
    --extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
    --sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt \
    --gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
    --gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
    --plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
    --plink_ref_keep /users/k1806347/brc_scratch/Data/1KG/Phase3/keep_files/EUR_samples.keep \
    --ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
  --rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
  --dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
  --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
  --PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
  --PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
    --ldpred2_ref_dir /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/ldpred2_ref \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL

EOF

sbatch --array 1-$(wc -l /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/todo.txt | cut -d' ' -f1)%200 /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/sbatch.sh
# Create .pos file
library(data.table)

# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL', pattern='.RDat')

# Start making pos file
pos<-data.frame(PANEL='eQTLGen.eQTL',
                WGT=paste0('eQTLGen.eQTL/',rdat_list),
                ID=gsub('\\..*','',rdat_list))

# Insert CHR, P0 and P1 (GRCh=37)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id_version','chromosome_name','start_position','end_position'), mart = ensembl)
Genes$ensembl_gene_id_version<-gsub('\\..*','',Genes$ensembl_gene_id_version)
names(Genes)<-c('ID','CHR','P0','P1')
Genes<-Genes[complete.cases(Genes),]
Genes<-Genes[!duplicated(Genes),]

pos<-merge(pos, Genes, all.x=T, by='ID')

# Read in each RDat file to retrieve N
for(i in 1:length(rdat_list)){
  load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL/',rdat_list[i]))
  pos$N[i]<-N.tot
}

pos<-pos[,c('PANEL','WGT','ID','CHR','P0','P1','N')]

write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL.pos', col.names=T, row.names=F, quote=F)

Generate SNP-weights from FUSION released YFS eQTL data

FUSION only provide the N of the sample and the Z of each SNP for each gene. We can convert this to the required information using reference MAF, though using these approximations is not ideal. This may not be the best comparison since they are finish and there may therefore be MAF and LD mismatch with target sample. Though I assume FUSION developers restricted the analysis to those os EUR ancestry.


Format eQTL sumstats

Show code

library(data.table)

# Read in the pos file
pos<-fread('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos')

# Read in EUR MAF
frq<-NULL
for(i in 1:22){
  frq<-rbind(frq, fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR.',i,'.frq')))
}
names(frq)[names(frq) == 'MAF']<-'FREQ'

for(i in 1:nrow(pos)){
  print(i)
  
  load(paste0('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/',pos$WGT[i]))
  tmp<-data.table(cbind(snps, wgt.matrix[,which(colnames(wgt.matrix) == 'top1')]))
  names(tmp)<-c('CHR','SNP','POS','BP','A1','A2','Z')
  tmp$N<-pos$N[i]
  
  # Insert EUR MAF
  tmp_match<-merge(tmp, frq[,c('SNP','A1','A2','FREQ'),with=F], by=c('SNP','A1','A2'))
  tmp_flip<-merge(tmp, frq[,c('SNP','A1','A2','FREQ'),with=F], by.x=c('SNP','A1','A2'), by.y=c('SNP','A2','A1'))
  tmp_flip$FREQ<-1-tmp_flip$FREQ
  tmp<-rbind(tmp_match, tmp_flip)
  
  # Approximate BETA, SE, and P
  tmp$P<-2*pnorm(-abs(tmp$Z))
  tmp$BETA<-tmp$Z/sqrt((2*tmp$FREQ)*(1-tmp$FREQ)*(tmp$N+sqrt(abs(tmp$Z))))
  tmp$SE<-abs(tmp$BETA)/abs(tmp$Z)
  
  tmp$GENE<-pos$ID[i]
    
  tmp<-tmp[,c('GENE','CHR','SNP','BP','A1','A2','BETA','SE','Z','P','FREQ','N'),with=F]
  
  if(i == 1){
    write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt', col.names=T, row.names=F, quote=F)
  } else {
    write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt', col.names=F, row.names=F, quote=F, append=T)
  }
}

# Make this the input format for eQTL_to_TWAS script.
# Incorporate DENTIST QC of sumstats?

Create TWAS weights

Show code


# Script modified to run only first 100 genes
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights.R \
    --extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
    --sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt \
    --gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
    --gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
    --plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
    --ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
  --rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
  --dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
  --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
  --PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
  --PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
    --output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/twas_weights
    
# Rename RDat folder to match FUSION format weights
mv /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/RDat_files /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL
# Create .pos file
library(data.table)
pos<-fread('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos')

# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL', pattern='.RDat')

# Subset pos file to those with RDat file
pos<-pos[pos$ID %in% gsub('.RDat','',rdat_list),]

pos$PANEL<-'YFS.eQTL'
pos$WGT<-paste0(pos$ID,'.RDat')

write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos', col.names=T, row.names=F, quote=F)

Create TWAS weights in parallel

Show code

# Run in parallel for first 100 genes

# Create file listing GWAS that haven't been processed.
mkdir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2
cut -f 1 -d' ' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt | tail -n +2 | uniq | head -n 100 > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/gene_list.txt

> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/todo.txt
for i in $(cat /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/gene_list.txt);do
  if [ ! -f /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL/${i}.done ]; then
    echo ${i} >> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/todo.txt
  fi
done

# Create shell script to run using sbatch
cat > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/sbatch.sh << 'EOF'
#!/bin/sh

#SBATCH -p shared,brc
#SBATCH --mem 10G
#SBATCH -n 1
#SBATCH --nodes 1
#SBATCH -J eQTL_to_TWAS

ID=$(awk -v var="$SLURM_ARRAY_TASK_ID" 'NR == var {print $1}' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/todo.txt)

/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights_V2.R \
    --id ${ID} \
    --extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
    --sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt \
    --gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
    --gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
    --plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
    --plink_ref_keep /users/k1806347/brc_scratch/Data/1KG/Phase3/keep_files/EUR_samples.keep \
    --ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
  --rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
  --dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
  --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
  --PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
  --PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
    --ldpred2_ref_dir /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/ldpred2_ref \
  --output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL

EOF

sbatch --array 1-$(wc -l /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/todo.txt | cut -d' ' -f1) /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/sbatch.sh
# Create .pos file
library(data.table)

# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL', pattern='.RDat')

# Start making pos file
pos<-data.frame(PANEL='YFS.eQTL',
                WGT=paste0('YFS.eQTL/',rdat_list),
                ID=gsub('\\..*','',rdat_list))

# Insert CHR, P0 and P1 (GRCh=37)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('external_gene_name','chromosome_name','start_position','end_position'), mart = ensembl)
names(Genes)<-c('ID','CHR','P0','P1')
Genes<-Genes[complete.cases(Genes),]
Genes<-Genes[!duplicated(Genes),]

pos<-merge(pos, Genes, all.x=T, by='ID')
pos<-pos[order(pos$CHR),]
pos<-pos[!duplicated(pos$WGT),]

# Read in each RDat file to retrieve N
for(i in 1:length(rdat_list)){
  load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL/',rdat_list[i]))
  pos$N[i]<-N.tot
}

pos<-pos[,c('PANEL','WGT','ID','CHR','P0','P1','N')]

write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL.pos', col.names=T, row.names=F, quote=F)

Create TWAS weights in parallel for all genes

Show code

# Create file listing GWAS that haven't been processed.
mkdir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall
cut -f 1 -d' ' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt | tail -n +2 | uniq > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/gene_list.txt

> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/todo.txt
for i in $(cat /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/gene_list.txt);do
  if [ ! -f /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL/${i}.done ]; then
    echo ${i} >> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/todo.txt
  fi
done

# Create shell script to run using sbatch
cat > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/sbatch.sh << 'EOF'
#!/bin/sh

#SBATCH -p shared,brc
#SBATCH --mem 10G
#SBATCH -n 1
#SBATCH --nodes 1
#SBATCH -J eQTL_to_TWAS

ID=$(awk -v var="$SLURM_ARRAY_TASK_ID" 'NR == var {print $1}' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/todo.txt)

/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights_V2.R \
    --id ${ID} \
    --extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
    --sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt \
    --gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
    --gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
    --plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
    --plink_ref_keep /users/k1806347/brc_scratch/Data/1KG/Phase3/keep_files/EUR_samples.keep \
    --ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
  --rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
  --dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
  --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
  --PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
  --PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
    --ldpred2_ref_dir /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/ldpred2_ref \
  --output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL

EOF

sbatch --array 1-$(wc -l /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/todo.txt | cut -d' ' -f1)%50 /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/sbatch.sh
# Create .pos file
library(data.table)

# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL', pattern='.RDat')

# Start making pos file
pos<-data.frame(PANEL='YFS.eQTL',
                WGT=paste0('YFS.eQTL/',rdat_list),
                ID=gsub('\\..*','',rdat_list))

# Insert CHR, P0 and P1 (GRCh=37)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('external_gene_name','chromosome_name','start_position','end_position'), mart = ensembl)
names(Genes)<-c('ID','CHR','P0','P1')
Genes<-Genes[complete.cases(Genes),]
Genes<-Genes[!duplicated(Genes),]

pos<-merge(pos, Genes, all.x=T, by='ID')
pos<-pos[order(pos$CHR),]
pos<-pos[!duplicated(pos$WGT),]

# Read in each RDat file to retrieve N
for(i in 1:length(rdat_list)){
  load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL/',rdat_list[i]))
  pos$N[i]<-N.tot
}

pos<-pos[,c('PANEL','WGT','ID','CHR','P0','P1','N')]

write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL.pos', col.names=T, row.names=F, quote=F)

Predict expression in 1KG


FUSION YFS models

Show code

sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
    --PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --weights /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos \
    --weights_dir /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR \
    --ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
    --n_cores 1 \
    --save_score T \
    --save_ref_expr T \
    --memory 10000 \
    --all_mod T \
    --pigz /users/k1806347/brc_scratch/Software/pigz \
    --ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.BLOOD.RNAARR

YFS eQTL-based models

Show code


sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
    --PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos \
    --weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS \
    --ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
    --n_cores 1 \
    --save_score T \
    --save_ref_expr T \
    --memory 10000 \
    --all_mod T \
    --pigz /users/k1806347/brc_scratch/Software/pigz \
    --ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.eQTL

YFS2 eQTL-based models

Show code


sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
    --PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL.pos \
    --weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2 \
    --ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
    --n_cores 1 \
    --save_score F \
    --save_ref_expr F \
    --memory 10000 \
    --all_mod T \
    --pigz /users/k1806347/brc_scratch/Software/pigz \
    --ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS2.eQTL

YFSall eQTL-based models

Show code


sbatch -p brc,shared --mem 30G -n 5 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
    --PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL.pos \
    --weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall \
    --ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
    --n_cores 5 \
    --save_score F \
    --save_ref_expr F \
    --memory 10000 \
    --all_mod T \
    --pigz /users/k1806347/brc_scratch/Software/pigz \
    --ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFSall.eQTL

eQTLGen eQTL-based models

Show code


sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
    --PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos \
    --weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen \
    --ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
    --n_cores 1 \
    --save_score T \
    --save_ref_expr T \
    --memory 10000 \
    --all_mod T \
    --pigz /users/k1806347/brc_scratch/Software/pigz \
    --ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/eQTLGen.eQTL

eQTLGen2 eQTL-based models

Show code


sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
    --PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL.pos \
    --weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2 \
    --ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
    --n_cores 1 \
    --save_score T \
    --save_ref_expr T \
    --memory 10000 \
    --all_mod T \
    --pigz /users/k1806347/brc_scratch/Software/pigz \
    --ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/eQTLGen2.eQTL

eQTLGenall eQTL-based models

Show code


sbatch -p brc,shared --mem 30G -n 5 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
    --PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL.pos \
    --weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall \
    --ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
    --n_cores 5 \
    --save_score T \
    --save_ref_expr T \
    --memory 10000 \
    --all_mod T \
    --pigz /users/k1806347/brc_scratch/Software/pigz \
    --ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/eQTLGenall.eQTL

Check correlation between predicted expression for YFS models

Show code

library(data.table)
fusion<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.BLOOD.RNAARR/FeaturePredictions_YFS.BLOOD.RNAARR.txt.gz')
eqtl<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.eQTL/FeaturePredictions_YFS.eQTL.txt.gz')

genes<-unique(gsub('\\..*','',gsub('YFS.eQTL.','',names(eqtl)[-1:-2])))

fusion<-fusion[,grepl(paste0('FID|IID|',paste(genes,collapse='|')),names(fusion)), with=F]
both<-merge(fusion, eqtl, by=c('FID','IID'))

cor_res<-list()
for(i in genes){
  tmp<-both[,grepl(paste0('\\.',i,'\\.'), names(both)), with=F]
  names(tmp)<-gsub('YFS.eQTL.RDat.','eqtl.',gsub('YFS.BLOOD.RNAARR.YFS.','twas.',gsub(paste0(i,'.'),'', names(tmp))))
  cor_res[[i]]<-cor(tmp, use='p')
}

# Check SNP-weight correlations
cor_res_melt<-melt(cor_res)
cor_res_melt$test<-paste0(cor_res_melt$Var1,'_',cor_res_melt$Var2)

cor_res_average<-NULL
for(i in unique(cor_res_melt$test)){
  cor_res_average<-rbind(cor_res_average, data.frame(test=i,
                                                     Mean=mean(cor_res_melt$value[cor_res_melt$test == i])))
}

library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS_cor_plot.png', units='px', width=3000, height=3000, res=300)
ggplot(cor_res_melt, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
  geom_boxplot() +
  coord_flip()
dev.off()

# Looks as expected, except eQTL.top1 model doesn't correlate well with fusion.top1 model. In many instances the correlation is 1, but there are many low correlation. This seems to be due to changing the top1 Z score in fusion files to a BETA estimated based on N and MAF. In some instances this changes the top SNP to a SNP that is uncorrelated with the top SNP in the fusion files. Perhaps I should select the top SNP based on the p-value, rather than the BETA. I think selecting based on p-value may be more concordant with standard pT+clump PRS methodology, and will avoid SNPs with small N and large SE being selected. Thougb since we are filtering by N, this shouldn;t be a problem, and usin the SNP with the largest BETA may be more appropriate. However, the correlation between top1 and other models is lower when using the largest abs(BETA), suggesting using the most significant SNP may be more appropriate. This should be discussed with Sasha.

load('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR/YFS.TRNAU1AP.wgt.RDat')
fusion_wgt<-data.frame(wgt.matrix)
load('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL/TRNAU1AP.RDat')
eqtl_wgt<-data.frame(wgt.matrix)

head(fusion_wgt[rev(order(abs(fusion_wgt$top1))),])
head(eqtl_wgt[rev(order(abs(eqtl_wgt$top1))),])

Show correlation between SNP-weights from each method

Looks as expected, except eQTL.top1 model doesn’t correlate well with fusion.top1 model. In many instances the correlation is 1, but there are many low correlation. This seems to be due to changing the top1 Z score in fusion files to a BETA estimated based on N and MAF. In some instances this changes the top SNP to a SNP that is uncorrelated with the top SNP in the fusion files. Perhaps I should select the top SNP based on the p-value, rather than the BETA. I think selecting based on p-value may be more concordant with standard pT+clump PRS methodology, and will avoid SNPs with small N and large SE being selected. Though since we are filtering by N, this shouldn’t be a problem, and using the SNP with the largest BETA may be more appropriate. However, the correlation between top1 and other models is lower when using the largest abs(BETA), suggesting using the most significant SNP may be more appropriate.


Check correlation between predicted expression for YFS2 models

Show code

library(data.table)
fusion<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.BLOOD.RNAARR/FeaturePredictions_YFS.BLOOD.RNAARR.txt.gz')
eqtl<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS2.eQTL/FeaturePredictions_YFS.eQTL.txt.gz')

genes<-unique(gsub('\\..*','',gsub('YFS.eQTL.','',names(eqtl)[-1:-2])))

fusion<-fusion[,grepl(paste0('FID|IID|',paste(genes,collapse='|')),names(fusion)), with=F]
both<-merge(fusion, eqtl, by=c('FID','IID'))

cor_res<-list()
for(i in genes){
  tmp<-both[,grepl(paste0('\\.',i,'\\.'), names(both)), with=F]
  names(tmp)<-gsub('YFS.eQTL.RDat.','eqtl.',gsub('YFS.BLOOD.RNAARR.YFS.','twas.',gsub(paste0(i,'.'),'', names(tmp))))
  cor_res[[i]]<-cor(tmp, use='p')
}

# Check SNP-weight correlations
cor_res_melt<-melt(cor_res)
cor_res_melt$test<-paste0(cor_res_melt$Var1,'_',cor_res_melt$Var2)

cor_res_average<-NULL
for(i in unique(cor_res_melt$test)){
  cor_res_average<-rbind(cor_res_average, data.frame(test=i,
                                                     Mean=mean(cor_res_melt$value[cor_res_melt$test == i])))
}

library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS2_cor_plot.png', units='px', width=3000, height=4000, res=300)
ggplot(cor_res_melt, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
  geom_boxplot() +
  coord_flip()
dev.off()

Show correlation between SNP-weights from each method

This is looking good. Everything is working as it should.


Predict expression in GTEx v8


This will be run by Zac. I will send him the score files and reference expression, and fusion LD reference data. I don’t know the file paths so I have just put file names.


For Zac


Prepare files for testing

Show code

mkdir -p /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac

# Copy FUSION reference with allele frequency files
cp -r /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF ./

# Copy FUSION .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
mkdir FUSION.YFS
cd FUSION.YFS

cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR ./

# Copy eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
mkdir eQTL.YFS
cd eQTL.YFS

cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL ./

# Copy pigz binary
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cp /users/k1806347/brc_scratch/Software/pigz ./

# Copy plink binary
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cp /users/k1806347/brc_scratch/Software/plink1.9/plink ./

# Copy FeaturePred.V2.0.R
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cp /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R ./

cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS
tar -zcvf for_zac.tar.gz for_zac

##########
# Make a new folder containing updated YFS and eQTLGen TWAS weights
##########

mkdir -p /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_26042022

# Copy YFS eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_26042022
mkdir eQTL.YFS
cd eQTL.YFS

cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL ./

# Copy eQTL-Gen eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_26042022
mkdir eQTL.eQTLGen
cd eQTL.eQTLGen

cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL ./

cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS
tar -zcvf for_zac_26042022.tar.gz for_zac_26042022

Prepare files for all genes

Show code

mkdir -p /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_160622
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_160622

# Copy eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_160622
mkdir eQTL.YFS
cd eQTL.YFS

cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL ./

rm YFS.eQTL/*.done

# Copy eQTL-Gen eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_160622
mkdir eQTL.eQTLGen
cd eQTL.eQTLGen

cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL ./

rm eQTLGen.eQTL/*.done

cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS
tar -zcvf for_zac_160622.tar.gz for_zac_160622

FUSION YFS models

Show code

cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac

Rscript FeaturePred.V2.0.R \
    --PLINK_prefix_chr LDREF/1000G.EUR. \
    --weights FUSION.YFS/YFS.BLOOD.RNAARR.pos \
    --weights_dir FUSION.YFS \
    --ref_ld_chr LDREF/1000G.EUR. \
    --plink ./plink \
    --n_cores 1 \
    --memory 10000 \
    --all_mod T \
    --pigz ./pigz \
    --ref_maf LDREF/1000G.EUR. \
    --output test/FUSION.YFS

YFS eQTL-based models

Show code


cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac

/users/k1806347/brc_scratch/Software/Rscript.sh FeaturePred.V2.0.nodel2.R \
    --PLINK_prefix_chr LDREF/1000G.EUR. \
    --weights eQTL.YFS/YFS.eQTL.pos \
    --weights_dir eQTL.YFS \
    --ref_ld_chr LDREF/1000G.EUR. \
    --plink ./plink \
    --n_cores 1 \
    --memory 10000 \
    --all_mod T \
    --pigz ./pigz \
    --ref_maf LDREF/1000G.EUR. \
    --output test/eQTL.YFS

Compare predicted and observed expression in GTEx v8

Zac now sends me the predicted expression data for GTEx v8 so I can compare predicted and observed expression. Observed expression in GTEx is publicly available.

Here I am reading in observed and predicted expression values, testing their correlation, and then summarising the results across gene expression imputation methods. I am using the processed and normalised observed expression in GTEx. First I covary observed expression for covariates used in the original eQTL analysis by GTEx.


Evaluate FUSION and eQTL-based YFS models

Show code

library(data.table)

# Read in the observed expression
obs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_expression_matrices/Whole_Blood.v8.normalized_expression.bed.gz')

# Insert external_gene_name
obs<-obs[,-1:-3]
obs$gene_id<-gsub('\\..*','',obs$gene_id)

library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id_version','external_gene_name'), mart = ensembl)
Genes$ensembl_gene_id_version<-gsub('\\..*','',Genes$ensembl_gene_id_version)

obs<-merge(obs, Genes, by.x='gene_id', by.y='ensembl_gene_id_version')
obs<-obs[,c('external_gene_name',names(obs)[grepl('GTEX-', names(obs))]), with=F]
obs<-t(obs)
colnames(obs)<-obs[1,]
obs<-obs[-1,]
obs<-cbind(row.names(obs),obs)
colnames(obs)[1]<-'ID'
obs<-data.table(obs)
obs<-cbind(obs[,1],data.frame(lapply(obs[,-1], function(x) as.numeric(as.character(x)))))

# Read in covariates
covs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_covariates/Whole_Blood.v8.covariates.txt')
covs<-t(covs)
colnames(covs)<-covs[1,]
covs<-covs[-1,]
covs<-cbind(row.names(covs),covs)
colnames(covs)[1]<-'ID'
covs<-data.table(covs)
covs<-cbind(covs[,1],data.frame(lapply(covs[,-1], function(x) as.numeric(as.character(x)))))

# Merge observed expression and covariates
obs<-obs[,!duplicated(names(obs)),with=F]
obs_covs<-merge(obs, covs, by='ID')

# Read in predicted expression
eqtl<-fread('/mnt/lustre/users/k1806347/Data/GTeX/v8/Zac/05072022/FeaturePredictions_YFS.eQTL.txt.gz')
names(eqtl)<-gsub('.RDat','',gsub('YFS.','', names(eqtl)))

fusion<-fread('/mnt/lustre/users/k1806347/Data/GTeX/v8/Zac/FUSION.YFS/FeaturePredictions_YFS.BLOOD.RNAARR.txt.gz')
names(fusion)<-gsub('YFS.BLOOD.RNAARR.YFS.','fusion.', names(fusion))

# Identify genes available in fusion and eqtl based models
obs_genes<-names(obs)[-1]
eqtl_genes<-unique(gsub('\\..*','',gsub('eQTL.','',names(eqtl)[-1:-2])))
fusion_genes<-unique(gsub('\\..*','',gsub('fusion.','',names(fusion)[-1:-2])))
both_genes<-intersect(obs_genes, eqtl_genes)
both_genes<-intersect(both_genes, fusion_genes)

# Residualise the covariates
obs_resid<-data.frame(ID=obs_covs$ID, stringsAsFactors=F)
for(i in both_genes){
  obs_resid[[i]]<-as.numeric(scale(resid(lm(as.formula(paste0(i,' ~ ', paste(names(covs)[-1], collapse=' + '))), data=obs_covs))))
}
obs_resid<-data.table(obs_resid)

# Merge eqtl and fusion predicted expression
pred_exp<-merge(eqtl, fusion, by=c('FID','IID'))

# Calculate correlation between observed expression and predicted expression from each method
cor_res<-list()
for(i in both_genes){
  # Rename columns to make output consistent across genes and label the observed expression
  pred_exp_tmp<-pred_exp[,grepl(paste0('FID$|IID$|\\.',i,'\\.'), names(pred_exp)), with=F]
  names(pred_exp_tmp)<-gsub(paste0('.',i), '', names(pred_exp_tmp))
  
  obs_exp_tmp<-obs_resid[,c('ID',i), with=F]
  names(obs_exp_tmp)[names(obs_exp_tmp) == i]<-'obs'
  
  # merge predicted and observed expression
  both_exp_tmp<-merge(pred_exp_tmp, obs_exp_tmp, by.x='IID', by.y='ID')
  
  both_exp_tmp$obs<-as.numeric(both_exp_tmp$obs)

  # Calculate correlation
  cor_res[[i]]<-cor(both_exp_tmp[,-1:-2,with=F], use='p')
}

# melt and combine all the results
cor_res_melt<-melt(cor_res)
cor_res_melt_obs<-cor_res_melt[cor_res_melt$Var1 == 'obs',]
cor_res_melt_obs<-cor_res_melt_obs[cor_res_melt_obs$Var1 != cor_res_melt_obs$Var2,]
cor_res_melt_obs$test<-paste0(cor_res_melt_obs$Var1,'_',cor_res_melt_obs$Var2)

write.table(cor_res_melt_obs, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_YFS.txt', row.names=F, quote=F)

# cor_res_melt_obs<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_YFS.txt')

cor_res_average<-NULL
for(i in unique(cor_res_melt_obs$test)){
  cor_res_average<-rbind(cor_res_average, data.frame(test=i,
                                                     Mean=mean(cor_res_melt_obs$value[cor_res_melt_obs$test == i])))
}

library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_boxplot.png', units='px', width=1500, height=1000, res=300)
ggplot(cor_res_melt_obs, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
  geom_boxplot() +
  labs(x='Test', y='Correlation') +
  coord_flip()
dev.off()

# Make a pairs plot
library("ggplot2")
library("GGally") 

cor_res_melt_obs_unmelt<-dcast(data = cor_res_melt_obs,formula = L1~Var2,fun.aggregate = sum,value.var = "value")

cor_res_melt_obs_unmelt$negative

lowerfun <- function(data,mapping){
  ggplot(data = data, mapping = mapping)+
    geom_point() +
    geom_abline(intercept =0 , slope = 1, colour='red') +
    geom_vline(xintercept= 0, colour='blue') +
    geom_hline(yintercept= 0, colour='blue') +
    xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))+
    ylim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}  

diagfun <- function(data,mapping){
  ggplot(data = data, mapping = mapping)+
    geom_density() +
    geom_vline(xintercept= 0, colour='blue') +
    xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}  

png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_pairsplot.png', units='px', width=5000, height=5000, res=300)
ggpairs(cor_res_melt_obs_unmelt[,-1], lower = list(continuous = wrap(lowerfun)), diag = list(continuous = wrap(diagfun)))
dev.off()

# Count the number of genes with correlation > 0.1 for each method
n_valid<-NULL
for(i in unique(cor_res_melt_obs$Var2)){
  n_valid<-rbind(n_valid, data.frame(Method=i,
                                     N_valid=sum(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i] > 0.1, na.rm=T),
                                     median_cor=median(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i], na.rm=T)))
}
n_valid<-n_valid[order(-n_valid$N_valid),]
n_valid$Freq_valid<-n_valid$N_valid/length(unique(cor_res_melt_obs$L1))

# Also count the number of times each method performed best
cor_res_melt_obs_top<-cor_res_melt_obs[order(-cor_res_melt_obs$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!is.na(cor_res_melt_obs_top$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!duplicated(cor_res_melt_obs_top$L1),]

n_valid$N_top<-NA
n_valid$median_top<-NA
for(i in n_valid$Method){
  n_valid$N_top[n_valid$Method == i]<-sum(cor_res_melt_obs_top$test == paste0('obs_',i))
  n_valid$median_top[n_valid$Method == i]<-median(cor_res_melt_obs_top$value[cor_res_melt_obs_top$test == paste0('obs_',i)])
}

n_valid$Prop_top<-n_valid$N_top/sum(n_valid$N_top)

# Also count the number of times each sumstat method performed best
cor_res_melt_obs_ss<-cor_res_melt_obs[grepl('eQTL', cor_res_melt_obs$test),]
cor_res_melt_obs_ss<-cor_res_melt_obs_ss[!is.na(cor_res_melt_obs_ss$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_ss[order(-cor_res_melt_obs_ss$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!duplicated(cor_res_melt_obs_top$L1),]

n_valid$N_top_ss<-NA
n_valid$median_top_ss<-NA
for(i in n_valid$Method){
  n_valid$N_top_ss[n_valid$Method == i]<-sum(cor_res_melt_obs_top$test == paste0('obs_',i))
  n_valid$median_top_ss[n_valid$Method == i]<-median(cor_res_melt_obs_top$value[cor_res_melt_obs_top$test == paste0('obs_',i)])
}

n_valid$Prop_top_ss<-n_valid$N_top_ss/sum(n_valid$N_top_ss)

# Also count the number of times each sumstat method performed best
cor_res_melt_obs_top1<-cor_res_melt_obs[grepl('top1', cor_res_melt_obs$test),]
cor_res_melt_obs_top1<-cor_res_melt_obs_top1[!is.na(cor_res_melt_obs_top1$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top1[order(-cor_res_melt_obs_top1$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!duplicated(cor_res_melt_obs_top$L1),]

n_valid$N_top_top1<-NA
n_valid$median_top_top1<-NA
for(i in n_valid$Method){
  n_valid$N_top_top1[n_valid$Method == i]<-sum(cor_res_melt_obs_top$test == paste0('obs_',i))
  n_valid$median_top_top1[n_valid$Method == i]<-median(cor_res_melt_obs_top$value[cor_res_melt_obs_top$test == paste0('obs_',i)])
}

n_valid$Prop_top_top1<-n_valid$N_top_top1/sum(n_valid$N_top_top1)

write.csv(n_valid, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/n_valid.csv', row.names=F)

# Read in reported R2 by weights
yfs.profile<-fread('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.profile')

Show correlation between observed and predicted expression

Method N_valid median_cor Freq_valid N_top median_top Prop_top N_top_ss median_top_ss Prop_top_ss N_top_top1 median_top_top1 Prop_top_top1
fusion.lasso 882 0.0478431 0.2206655 353 0.0908498 0.0884047 0 NA 0.0000000 0 NA 0.0000000
fusion.enet 876 0.0483409 0.2191644 343 0.0862650 0.0859003 0 NA 0.0000000 0 NA 0.0000000
fusion.bslmm 874 0.0485076 0.2186640 317 0.0959522 0.0793889 0 NA 0.0000000 0 NA 0.0000000
eQTL.prscs 790 0.0473161 0.1976482 220 0.0901410 0.0550964 652 0.0861865 0.1632858 0 NA 0.0000000
fusion.top1 774 0.0414693 0.1936452 440 0.0713319 0.1101928 0 NA 0.0000000 1801 0.0582154 0.4527401
fusion.blup 762 0.0450951 0.1906430 394 0.0670281 0.0986727 0 NA 0.0000000 0 NA 0.0000000
eQTL.lassosum 698 0.0411294 0.1746310 384 0.0633383 0.0961683 931 0.0697682 0.2331580 0 NA 0.0000000
eQTL.dbslmm 655 0.0429859 0.1638729 291 0.0664697 0.0728775 501 0.0630978 0.1254696 0 NA 0.0000000
eQTL.sbayesr_robust 621 0.0442815 0.1553665 126 0.0675922 0.0315552 239 0.0668272 0.0598547 0 NA 0.0000000
eQTL.ldpred2 554 0.0384420 0.1386040 278 0.0557534 0.0696218 430 0.0555669 0.1076885 0 NA 0.0000000
eQTL.sbayesr 520 0.0361438 0.1300976 189 0.0540907 0.0473328 301 0.0512844 0.0753819 0 NA 0.0000000
eQTL.top1 487 0.0255226 0.1218414 658 0.0539891 0.1647884 939 0.0545328 0.2351615 2177 0.0429274 0.5472599

The correlations are very low compared the R2 reported in the FUSION profile. This is true when using fusion or eQTL based models. Is this due to poor generalisablity across YFS and GTEx?


Evaluate eQTLGen derived models

Show code

library(data.table)

# Read in the observed expression
obs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_expression_matrices/Whole_Blood.v8.normalized_expression.bed.gz')

# Insert external_gene_name
obs<-obs[,-1:-3]
obs$gene_id<-gsub('\\..*','',obs$gene_id)

obs<-t(obs)
colnames(obs)<-obs[1,]
obs<-obs[-1,]
obs<-cbind(row.names(obs),obs)
colnames(obs)[1]<-'ID'
obs<-data.table(obs)
obs<-cbind(obs[,1],data.frame(lapply(obs[,-1], function(x) as.numeric(as.character(x)))))

# Read in covariates
covs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_covariates/Whole_Blood.v8.covariates.txt')
covs<-t(covs)
colnames(covs)<-covs[1,]
covs<-covs[-1,]
covs<-cbind(row.names(covs),covs)
colnames(covs)[1]<-'ID'
covs<-data.table(covs)
covs<-cbind(covs[,1],data.frame(lapply(covs[,-1], function(x) as.numeric(as.character(x)))))

# Merge observed expression and covariates
obs<-obs[,!duplicated(names(obs)),with=F]
obs_covs<-merge(obs, covs, by='ID')

# Read in predicted expression
eqtl<-fread('/mnt/lustre/users/k1806347/Data/GTeX/v8/Zac/05072022/FeaturePredictions_eQTLGen.eQTL.txt.gz')
names(eqtl)<-gsub('.RDat','',gsub('eQTLGen.','', names(eqtl)))

# Identify genes available in fusion and eqtl based models
obs_genes<-names(obs)[-1]
eqtl_genes<-unique(gsub('\\..*','',gsub('eQTL.','',names(eqtl)[-1:-2])))
both_genes<-intersect(obs_genes, eqtl_genes)

# Residualise the covariates
obs_resid<-data.frame(ID=obs_covs$ID, stringsAsFactors=F)
for(i in both_genes){
  obs_resid[[i]]<-as.numeric(scale(resid(lm(as.formula(paste0(i,' ~ ', paste(names(covs)[-1], collapse=' + '))), data=obs_covs))))
}
obs_resid<-data.table(obs_resid)

# Calculate correlation between observed expression and predicted expression from each method
cor_res<-list()
for(i in both_genes){
  # Rename columns to make output consistent across genes and label the observed expression
  pred_exp_tmp<-eqtl[,grepl(paste0('FID$|IID$|\\.',i,'\\.'), names(eqtl)), with=F]
  names(pred_exp_tmp)<-gsub(paste0('.',i), '', names(pred_exp_tmp))
  
  obs_exp_tmp<-obs_resid[,c('ID',i), with=F]
  names(obs_exp_tmp)[names(obs_exp_tmp) == i]<-'obs'
  
  # merge predicted and observed expression
  both_exp_tmp<-merge(pred_exp_tmp, obs_exp_tmp, by.x='IID', by.y='ID')
  
  both_exp_tmp$obs<-as.numeric(both_exp_tmp$obs)

  # Calculate correlation
  cor_res[[i]]<-cor(both_exp_tmp[,-1:-2,with=F], use='p')
}

# melt and combine all the results
cor_res_melt<-melt(cor_res)
cor_res_melt_obs<-cor_res_melt[cor_res_melt$Var1 == 'obs',]
cor_res_melt_obs<-cor_res_melt_obs[cor_res_melt_obs$Var1 != cor_res_melt_obs$Var2,]
cor_res_melt_obs$test<-paste0(cor_res_melt_obs$Var1,'_',cor_res_melt_obs$Var2)

write.table(cor_res_melt_obs, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor.txt', row.names=F, quote=F)

# cor_res_melt_obs<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor.txt')

cor_res_average<-NULL
for(i in unique(cor_res_melt_obs$test)){
  cor_res_average<-rbind(cor_res_average, data.frame(test=i,
                                                     Mean=mean(cor_res_melt_obs$value[cor_res_melt_obs$test == i]),
                                                     N=sum(cor_res_melt_obs$test == i)))
}

library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_boxplot_eQTLGen.png', units='px', width=1500, height=1000, res=300)
ggplot(cor_res_melt_obs, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
  geom_boxplot() +
  labs(x='Test', y='Correlation') +
  coord_flip()
dev.off()

# Make a pairs plot
library("ggplot2")
library("GGally") 

cor_res_melt_obs_unmelt<-dcast(data = cor_res_melt_obs,formula = L1~Var2,fun.aggregate = sum,value.var = "value")

# Set values that equal 0 to NA as these represent missing data
cor_res_melt_obs_unmelt[cor_res_melt_obs_unmelt == 0]<-NA

lowerfun <- function(data,mapping){
  ggplot(data = data, mapping = mapping)+
    geom_point() +
    geom_abline(intercept =0 , slope = 1, colour='red') +
    geom_vline(xintercept= 0, colour='blue') +
    geom_hline(yintercept= 0, colour='blue') +
    xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))+
    ylim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}  

diagfun <- function(data,mapping){
  ggplot(data = data, mapping = mapping)+
    geom_density() +
    geom_vline(xintercept= 0, colour='blue') +
    xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}  

png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_pairsplot_eQTLGen.png', units='px', width=3000, height=3000, res=300)
ggpairs(cor_res_melt_obs_unmelt[,-1], lower = list(continuous = wrap(lowerfun)), diag = list(continuous = wrap(diagfun)))
dev.off()

# Count the number of genes with correlation > 0.1 for each method
n_valid<-NULL
for(i in unique(cor_res_melt_obs$Var2)){
  n_valid<-rbind(n_valid, data.frame(Method=i,
                                     N_valid=sum(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i] > 0.1, na.rm=T),
                                     median_cor=median(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i], na.rm=T)))
}
n_valid<-n_valid[order(-n_valid$N_valid),]
n_valid$Freq_valid<-n_valid$N_valid/length(unique(cor_res_melt_obs$L1))

# Also count the number of times each method performed best
cor_res_melt_obs_top<-cor_res_melt_obs[order(-cor_res_melt_obs$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!is.na(cor_res_melt_obs_top$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!duplicated(cor_res_melt_obs_top$L1),]

n_valid$N_top<-NA
n_valid$median_top<-NA
for(i in n_valid$Method){
  n_valid$N_top[n_valid$Method == i]<-sum(cor_res_melt_obs_top$test == paste0('obs_',i))
  n_valid$median_top[n_valid$Method == i]<-median(cor_res_melt_obs_top$value[cor_res_melt_obs_top$test == paste0('obs_',i)])
}

n_valid$Prop_top<-n_valid$N_top/sum(n_valid$N_top)

write.csv(n_valid, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/n_valid_eQTLGen.csv', row.names=F)

Show correlation between observed and predicted expression

Method N_valid median_cor Freq_valid N_top Prop_top median_top
eQTL.sbayesr_robust 2418 0.0452008 0.1720017 1222 0.0869318 0.0795815
eQTL.prscs 2376 0.0435711 0.1690141 1793 0.1275521 0.0738549
eQTL.ldpred2 2256 0.0413694 0.1604780 1978 0.1407128 0.0668918
eQTL.lassosum 2238 0.0403013 0.1591976 2361 0.1679590 0.0576601
eQTL.dbslmm 1984 0.0406331 0.1411296 1888 0.1343103 0.0590594
eQTL.top1 1875 0.0321053 0.1333760 3271 0.2326955 0.0583043
eQTL.sbayesr 1372 0.0298598 0.0975957 1544 0.1098385 0.0521219

The predicted-observed correlation is looking better here, with 87% of genes achieving a correlation > 0.1. However, the order of the methods has changed a bit, with top1 and SBayesR (in robust mode) performing best. I am not going to read into the order of methods until we have run across all genes. Note. SuSiE does not converge due to LD reference mismatch leading to many genes not being imputed using SuSiE.


Comparison of FUSION YFS and eQTLGen models

# Read in pred_obs_cor for YFS and eQTLGen
eqtlgen<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor.txt')
yfs<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_YFS.txt')

# Subset YFS models from fusion
yfs<-yfs[grepl('fusion',yfs$Var2),]

# Compare number of (valid) genes from YFS and eQTLGen
length(unique(eqtlgen$L1)) # 14058
length(unique(yfs$L1)) # 3997

length(unique(eqtlgen$L1[eqtlgen$value > 0.1])) # 3453
length(unique(yfs$L1[yfs$value > 0.1])) # 1081

# Subset genes available in both fusion YFS and eQTLGen
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id','external_gene_name'), mart = ensembl)
eqtlgen<-merge(eqtlgen, Genes, by.x='L1', by.y='ensembl_gene_id')
eqtlgen$L1<-eqtlgen$external_gene_name

eqtlgen_subset<-eqtlgen[eqtlgen$L1 %in% yfs$L1,]
yfs_subset<-yfs[yfs$L1 %in% eqtlgen$L1,]

# Compare R for best eQTLGen model to best FUSION YFS model
eqtlgen_subset<-eqtlgen_subset[order(-eqtlgen_subset$value),]
eqtlgen_subset<-eqtlgen_subset[!is.na(eqtlgen_subset$value),]
eqtlgen_subset<-eqtlgen_subset[!duplicated(eqtlgen_subset$L1),]
yfs_subset<-yfs_subset[order(-yfs_subset$value),]
yfs_subset<-yfs_subset[!is.na(yfs_subset$value),]
yfs_subset<-yfs_subset[!duplicated(yfs_subset$L1),]

both_best<-merge(eqtlgen_subset, yfs_subset, by='L1')
nrow(both_best) # 3843
median(both_best$value.x) # 0.0732
median(both_best$value.y) # 0.062

library(ggplot2)
library(cowplot)

png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_eQTLGen_YFS_comp.png', units='px', width=1500, height=1500, res=300)

ggplot(both_best, aes(x=value.x, y=value.y)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, colour='red') +
  geom_vline(xintercept= 0, colour='blue') +
  geom_hline(yintercept= 0, colour='blue') +
  coord_fixed() +
  theme_half_open() +
  background_grid() +
  labs(x="R (eQTLGen)", y="R (FUSION YFS)")

dev.off()

# Look at distribution of R for best model for genes not in YFS
eqtlgen_not_in_yfs<-eqtlgen[!(eqtlgen$L1 %in% yfs$L1),]
eqtlgen_not_in_yfs<-eqtlgen_not_in_yfs[!is.na(eqtlgen_not_in_yfs$value),]
eqtlgen_not_in_yfs<-eqtlgen_not_in_yfs[!duplicated(eqtlgen_not_in_yfs$L1),]

eqtlgen_not_in_yfs$YFS<-F
eqtlgen_subset$YFS<-T

eqtlgen_comp_plot<-rbind(eqtlgen_not_in_yfs, eqtlgen_subset)

png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/present_in_YFS_hist.png', units='px', width=1500, height=1000, res=300)

ggplot(eqtlgen_comp_plot, aes(x=value, fill=YFS, colour=YFS)) +
  geom_histogram(position="identity", alpha=0.5) +
  theme_half_open() +
  background_grid() +
  labs(x='R', fill = 'In YFS', colour='In YFS')

dev.off()

Show eQTLGen and FUSION YFS comparison


Evaluation of FUSION pseudovalidation

Show code

# Prepare the eQTL summary statistics for each gene as if they were GWAS summary statistics
library(data.table)

# Read in eQTL sumstats
eqtl_sumstats<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt')

# Create list of genes
genes<-unique(eqtl_sumstats$GENE)

# For each gene
for(i in 1:length(genes)){
  # Subset sumstats for gene i
  eqtl_sumstats_i<-eqtl_sumstats[eqtl_sumstats$GENE == genes[i],]
  
  # Munge for FUSION
  eqtl_sumstats_i<-eqtl_sumstats_i[,c('SNP','A1','A2','Z','N'),with=F]
  eqtl_sumstats_i<-eqtl_sumstats_i[complete.cases(eqtl_sumstats_i),]
  eqtl_sumstats_i<-eqtl_sumstats_i[eqtl_sumstats_i$N >= 0.8*max(eqtl_sumstats_i$N),]

  # Write out results
  fwrite(eqtl_sumstats_i, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/per_gene_eqtl/',genes[i],'.gz'), quote=F, sep='\t')
}

Adapt FUSION.assoc_test.R to run TWAS for specific SNP-weight. We will force it to use each model. If this approach works well we can streamline this process substantially.

# Now run TWAS for each gene using all models.

library(data.table)

# Read in list of eQTLGen SNP-weights
pos<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos')

# For each gene
for(i in 1:nrow(pos)){
  for(mod in c('top1','sbayesr','sbayesr_robust','dbslmm','prscs','susie')){
    if(!file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/pseudo_res/', pos$ID[i],'.',mod))){
      system(paste0('/users/k1806347/brc_scratch/Software/Rscript.sh /scratch/users/k1806347/Software/MyGit/eQTL_to_TWAS/FUSION.assoc_test.pseudo.R --sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/per_gene_eqtl/',pos$ID[i],'.gz --extract_weight ',pos$ID[i],' --weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos --weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen --ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/pseudo_res/', pos$ID[i],'.',mod,' --chr ',pos$CHR[i],' --force_model ',mod))
    }
  }
}

# Read in the TWAS.Z results
pseudo_res<-NULL
for(i in 1:nrow(pos)){
  for(mod in c('top1','sbayesr','sbayesr_robust','dbslmm','prscs','susie')){
    res_i_mod<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/pseudo_res/', pos$ID[i],'.',mod))
    
    pseudo_res<-rbind(pseudo_res, data.frame(ID=pos$ID[i],
                                             Model=mod,
                                             TWAS.Z=res_i_mod$TWAS.Z))
  }
}

# Read in the correlation between predicted and observed expression in GTEx v8.
cor_res_melt_obs<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor.txt')

cor_res_melt_obs<-cor_res_melt_obs[,c('L1','Var2','value'),with=F]
names(cor_res_melt_obs)<-c('ID','Model','R')
cor_res_melt_obs$Model<-gsub('eQTL.','',cor_res_melt_obs$Model)

obs_pseudo<-merge(pseudo_res, cor_res_melt_obs, by=c('ID','Model'))

# Determine the best model based on TWAS.Z and observed cor, and rank methods by TWAS.Z and observed cor for each gene.
genes<-unique(obs_pseudo$ID)
obs_pseudo_rank<-NULL
rank_diff<-NULL
n_top<-NULL
for(i in 1:length(genes)){
  obs_pseudo_i<-obs_pseudo[obs_pseudo$ID == genes[i],]
  obs_pseudo_i$pseudo_rank<-rank(-obs_pseudo_i$TWAS.Z)
  obs_pseudo_i$obs_rank<-rank(-obs_pseudo_i$R)
  obs_pseudo_rank<-rbind(obs_pseudo_rank, obs_pseudo_i)
  
  rank_diff<-rbind(rank_diff, data.frame(ID=genes[i],
                                         pseudo_top_mod=obs_pseudo_i$Model[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1],
                                         pseudo_R = obs_pseudo_i$R[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1],
                                         obs_top_mod=obs_pseudo_i$Model[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1],
                                         top_R = obs_pseudo_i$R[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1]))
  
}

# Calculate correlation between model rank for each gene
cor(obs_pseudo_rank$pseudo_rank, obs_pseudo_rank$obs_rank) # 0.6105157

# Calculate the mean difference in R between the best model and the model selected by TWAS.Z
rank_diff$diff<-rank_diff$pseudo_R-rank_diff$top_R
mean(rank_diff$diff) # -0.006272514

# Calculate the percentage of genes where pseuodoval did not select the correct model
nrow(rank_diff[rank_diff$diff != 0,])/nrow(rank_diff) # 0.4155844

# Crosstab the selected observed and pseudoval-selected best model
saveRDS(table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod),'/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/best_pseudo_comp.rds')
table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod)
table(rank_diff$pseudo_top_mod)
table(rank_diff$obs_top_mod)

rank_diff<-rank_diff[order(rank_diff$diff),]

library(ggplot2)
library(cowplot)

png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/best_pseudo_comp.png', res=300, units='px', width=1000, height=1000)
ggplot(rank_diff, aes(x=top_R, y=pseudo_R)) +
  geom_abline(intercept = 0, slope = 1, colour='red') +
  geom_point() +
  coord_fixed() +
  theme_half_open() +
  xlim(c(0,NA)) +
  ylim(c(0,NA)) +
  background_grid() +
  labs(x="R (Best)", y="R (Pseudovalidated)")
dev.off()


##########
# Identify genes with abnormally low correlation with observed when using SuSiE
##########

cor_res_melt_obs_susie<-cor_res_melt_obs[cor_res_melt_obs$Model == 'susie',]
# They have already been removed, probably due to zero standard deviation in predicted epxression. I am guessing there is a BETA of 0 for all variants in these cases.

genes[!(genes %in% cor_res_melt_obs_susie$ID)]

# Load the Rdat file for genes without SusiE predicted observed correlation

Comparison of model that performed best in GTEx to model that was selected by pseudovalidation

##                    top1_pseudo sbayesr_pseudo sbayesr_robust_pseudo
## top1_obs                    39              0                     7
## sbayesr_obs                  2              0                     0
## sbayesr_robust_obs           1              1                     2
## dbslmm_obs                   1              1                     0
## prscs_obs                    0              0                     0
## susie_obs                    1              0                     0
##                    dbslmm_pseudo prscs_pseudo susie_pseudo
## top1_obs                       6            2            4
## sbayesr_obs                    1            0            0
## sbayesr_robust_obs             1            0            1
## dbslmm_obs                     2            1            0
## prscs_obs                      1            0            0
## susie_obs                      0            1            2

These results are really encouraging. There is very little difference between the observed-predicted expression correlation for the best performing model and the model selected by pseudovalidation. There is some overfitting in the GTEx data as well since there is the best model is being selected in the GTEx sample. I am not sure there is any need to try testing MegaPRS pseudovalidation.


Streamline FUSION pseudovalidation


mkdir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2
for chr in $(seq 1 22);do
  sbatch -p brc,shared --mem 10G /users/k1806347/brc_scratch/Software/Rscript.sh /scratch/users/k1806347/Software/MyGit/eQTL_to_TWAS/FUSION.assoc_test.pseudo_V2.R \
    --sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt \
    --weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL.pos \
    --weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall \
    --ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2/pseudo_res_chr${chr} \
    --chr ${chr} \
    --all_mod T
done
library(data.table)

# Read in pseudoval results
pseudo_res_2<-NULL
for(i in 1:22){
  pseudo_res_2<-rbind(pseudo_res_2, fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2/pseudo_res_chr',i)))
}

names(pseudo_res_2)[names(pseudo_res_2) == 'MODEL']<-'Model'
pseudo_res_2<-pseudo_res_2[,c('ID','Model','TWAS.Z'), with=F]

pos<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL.pos')
pos2<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos')

pseudo_res<-NULL
for(i in 1:nrow(pos2)){
  for(mod in c('top1','sbayesr','sbayesr_robust','dbslmm','prscs','susie')){
    res_i_mod<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/pseudo_res/', pos2$ID[i],'.',mod))
    
    pseudo_res<-rbind(pseudo_res, data.frame(ID=pos2$ID[i],
                                             Model=mod,
                                             TWAS.Z=res_i_mod$TWAS.Z))
  }
}

pseudo_comp<-merge(pseudo_res, pseudo_res_2, by=c('ID','Model'))
# New ad old pseudoval scripts correlate perfectly.

# Read in the correlation between predicted and observed expression in GTEx v8.
cor_res_melt_obs<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor.txt')

cor_res_melt_obs<-cor_res_melt_obs[,c('L1','Var2','value'),with=F]
names(cor_res_melt_obs)<-c('ID','Model','R')
cor_res_melt_obs$Model<-gsub('eQTL.','',cor_res_melt_obs$Model)

obs_pseudo<-merge(pseudo_res_2, cor_res_melt_obs, by=c('ID','Model'))

# Determine the best model based on TWAS.Z and observed cor, and rank methods by TWAS.Z and observed cor for each gene.
genes<-unique(obs_pseudo$ID)
obs_pseudo_rank<-NULL
rank_diff<-NULL
n_top<-NULL
for(i in 1:length(genes)){
  obs_pseudo_i<-obs_pseudo[obs_pseudo$ID == genes[i],]
  obs_pseudo_i$pseudo_rank<-rank(-obs_pseudo_i$TWAS.Z)
  obs_pseudo_i$obs_rank<-rank(-obs_pseudo_i$R)
  obs_pseudo_rank<-rbind(obs_pseudo_rank, obs_pseudo_i)
  
  rank_diff<-rbind(rank_diff, data.frame(ID=genes[i],
                                         pseudo_top_mod=obs_pseudo_i$Model[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1],
                                         pseudo_R = obs_pseudo_i$R[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1],
                                         obs_top_mod=obs_pseudo_i$Model[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1],
                                         top_R = obs_pseudo_i$R[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1]))
  
}

# Calculate correlation between model rank for each gene
cor(obs_pseudo_rank$pseudo_rank, obs_pseudo_rank$obs_rank) # 0.08380318

# Calculate the mean difference in R between the best model and the model selected by TWAS.Z
rank_diff$diff<-rank_diff$pseudo_R-rank_diff$top_R
mean(rank_diff$diff, na.rm=T) # -0.02506432

# Calculate the percentage of genes where pseuodoval did not select the correct model
nrow(rank_diff[rank_diff$diff != 0,])/nrow(rank_diff) # 0.8415651

# Crosstab the selected observed and pseudoval-selected best model
saveRDS(table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod),'/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2/best_pseudo_comp_eQTLGen.rds')
table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod)
table(rank_diff$pseudo_top_mod)
table(rank_diff$obs_top_mod)

rank_diff<-rank_diff[order(rank_diff$diff),]

library(ggplot2)
library(cowplot)

png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2/best_pseudo_comp_eQTLGen.png', res=300, units='px', width=1000, height=1000)
ggplot(rank_diff, aes(x=top_R, y=pseudo_R)) +
  geom_abline(intercept = 0, slope = 1, colour='red') +
  geom_point() +
  coord_fixed() +
  theme_half_open() +
  background_grid() +
  labs(x="R (Best)", y="R (Pseudovalidated)")
dev.off()

png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2/best_pseudo_comp_eQTLGen_hist.png', res=300, units='px', width=1500, height=1000)
ggplot(rank_diff, aes(x=diff)) +
  geom_histogram() +
  theme_half_open() +
  background_grid() +
  labs(x="Difference in R")
dev.off()

Comparison of model that performed best in GTEx to model that was selected by pseudovalidation

## Loading required package: lattice
## Loading required package: ggplot2
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'tibble'
## Confusion Matrix and Statistics
## 
##                 
##                  dbslmm lassosum prscs sbayesr sbayesr_robust ldpred2 top1
##   dbslmm            715      803   425     450            311     529 1054
##   lassosum         1011     1324  1231     935            809    1275 1906
##   prscs               3        8     3       3             10       6    7
##   sbayesr           113      183   101     133             75     111  214
##   sbayesr_robust     23       18    16       5              9      16   19
##   ldpred2            19       26    12      10              6      36   54
##   top1                1        4     2       3              0       1    3
## 
## Overall Statistics
##                                           
##                Accuracy : 0.1584          
##                  95% CI : (0.1524, 0.1646)
##     No Information Rate : 0.2321          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0061          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: dbslmm Class: lassosum Class: prscs Class: sbayesr
## Sensitivity                0.37931         0.55959    0.0016760       0.086420
## Specificity                0.70591         0.38560    0.9969774       0.936199
## Pos Pred Value             0.16678         0.15593    0.0750000       0.143011
## Neg Pred Value             0.87993         0.81191    0.8722750       0.892680
## Prevalence                 0.13435         0.16863    0.1275747       0.109686
## Detection Rate             0.05096         0.09436    0.0002138       0.009479
## Detection Prevalence       0.30554         0.60516    0.0028508       0.066282
## Balanced Accuracy          0.54261         0.47260    0.4993267       0.511309
##                      Class: sbayesr_robust Class: ldpred2 Class: top1
## Sensitivity                      0.0073770       0.018237   0.0009211
## Specificity                      0.9924284       0.989467   0.9989790
## Pos Pred Value                   0.0849057       0.220859   0.2142857
## Neg Pred Value                   0.9130341       0.860254   0.7678533
## Prevalence                       0.0869503       0.140688   0.2321289
## Detection Rate                   0.0006414       0.002566   0.0002138
## Detection Prevalence             0.0075547       0.011617   0.0009978
## Balanced Accuracy                0.4999027       0.503852   0.4999501

Evaluate pseudovalidation using all FUSION YFS models


mkdir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs

for chr in $(seq 1 22);do
  sbatch -p brc,shared --exclude=nodeb11,nodeb16,nodeb09,nodeb18 --mem 10G /users/k1806347/brc_scratch/Software/Rscript.sh /scratch/users/k1806347/Software/MyGit/eQTL_to_TWAS/FUSION.assoc_test.pseudo_V2.R \
    --sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt \
    --weights /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos \
    --weights_dir /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR \
    --ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
    --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs/pseudo_res_chr${chr} \
    --chr ${chr} \
    --all_mod T
done
library(data.table)

# Read in pseudoval results
pseudo_res_2<-NULL
for(i in 1:22){
  pseudo_res_2<-rbind(pseudo_res_2, fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs/pseudo_res_chr',i)))
}

# Determine the best model based on TWAS.Z and observed cor, and rank methods by TWAS.Z and observed cor for each gene.
genes<-unique(pseudo_res_2$ID)
obs_pseudo_rank<-NULL
rank_diff<-NULL
n_top<-NULL
for(i in 1:length(genes)){
  obs_pseudo_i<-pseudo_res_2[pseudo_res_2$ID == genes[i],]
  obs_pseudo_i$pseudo_rank<-rank(-obs_pseudo_i$TWAS.Z)
  obs_pseudo_i$obs_rank<-rank(-obs_pseudo_i$MODELCV.R2)
  obs_pseudo_rank<-rbind(obs_pseudo_rank, obs_pseudo_i)
  
  rank_diff<-rbind(rank_diff, data.frame(ID=genes[i],
                                         pseudo_top_mod=obs_pseudo_i$MODEL[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1],
                                         pseudo_R = sqrt(obs_pseudo_i$MODELCV.R2[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1]),
                                         obs_top_mod=obs_pseudo_i$MODEL[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1],
                                         top_R = sqrt(obs_pseudo_i$MODELCV.R2[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1])))
  
}

# Calculate correlation between model rank for each gene
cor(obs_pseudo_rank$pseudo_rank, obs_pseudo_rank$obs_rank) # 0.0623

# Calculate the mean difference in R between the best model and the model selected by TWAS.Z
rank_diff$diff<-rank_diff$pseudo_R-rank_diff$top_R
mean(rank_diff$diff, na.rm=T) # -0.024

# Calculate the percentage of genes where pseuodoval did not select the correct model
nrow(rank_diff[rank_diff$diff != 0,])/nrow(rank_diff) # 0.827

# Crosstab the selected observed and pseudoval-selected best model
saveRDS(table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod),'/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs/best_pseudo_comp.rds')
table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod)
table(rank_diff$pseudo_top_mod)
table(rank_diff$obs_top_mod)

rank_diff<-rank_diff[order(rank_diff$diff),]

library(ggplot2)
library(cowplot)

png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs/best_pseudo_comp.png', res=300, units='px', width=1000, height=1000)
ggplot(rank_diff, aes(x=top_R, y=pseudo_R)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, colour='red') +
  coord_fixed() +
  theme_half_open() +
  background_grid() +
  labs(x="R (Best)", y="R (Pseudovalidated)")
dev.off()

png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs/best_pseudo_comp_hist.png', res=300, units='px', width=1500, height=1000)
ggplot(rank_diff, aes(x=diff)) +
  geom_histogram() +
  theme_half_open() +
  background_grid() +
  labs(x="Difference in R")
dev.off()

Comparison of model that performed best in YFS to model that was selected by pseudovalidation

## Confusion Matrix and Statistics
## 
##        
##         enet bslmm blup lasso
##   enet   186   155  103    65
##   bslmm  130   223  206    94
##   blup   590  1184  215   803
##   lasso    1     1    0     2
## 
## Overall Statistics
##                                           
##                Accuracy : 0.1582          
##                  95% CI : (0.1469, 0.1699)
##     No Information Rate : 0.3949          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0371         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: enet Class: bslmm Class: blup Class: lasso
## Sensitivity              0.20507      0.14267     0.41031    0.0020747
## Specificity              0.89413      0.82046     0.24956    0.9993320
## Pos Pred Value           0.36542      0.34150     0.07701    0.5000000
## Neg Pred Value           0.79095      0.59455     0.73499    0.7567021
## Prevalence               0.22916      0.39490     0.13239    0.2435574
## Detection Rate           0.04699      0.05634     0.05432    0.0005053
## Detection Prevalence     0.12860      0.16498     0.70541    0.0010106
## Balanced Accuracy        0.54960      0.48157     0.32993    0.5007033

The variance explained by the best model identified by pseudovalidation is very similar to the best model identified by cross validation, however the pseudoval model is often not the model identified by cross validation.